aboutsummaryrefslogtreecommitdiff
path: root/include/linux
AgeCommit message (Collapse)AuthorFilesLines
2023-10-04introduce __next_thread(), fix next_tid() vs exec() raceOleg Nesterov1-0/+11
Patch series "introduce __next_thread(), change next_thread()". After commit dce8f8ed1de1 ("document while_each_thread(), change first_tid() to use for_each_thread()") + this series 1. We have only one lockless user of next_thread(), task_group_seq_get_next(). I think it should be changed too. 2. We have only one user of task_struct->thread_group, thread_group_empty(). The next patches will change thread_group_empty() and kill ->thread_group. This patch (of 2): next_tid(start) does: rcu_read_lock(); if (pid_alive(start)) { pos = next_thread(start); if (thread_group_leader(pos)) pos = NULL; else get_task_struct(pos); it should return pos = NULL when next_thread() wraps to the 1st thread in the thread group, group leader, and the thread_group_leader() check tries to detect this case. But this can race with exec. To simplify, suppose we have a main thread M and a single sub-thread T, next_tid(T) should return NULL. Now suppose that T execs. If next_tid(T) is called after T changes the leadership and before it does release_task() which removes the old leader from list, then next_thread() returns M and thread_group_leader(M) = F. Lockless use of next_thread() should be avoided. After this change only task_group_seq_get_next() does this, and I believe it should be changed as well. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Oleg Nesterov <[email protected]> Cc: Eric W. Biederman <[email protected]> Cc: Peter Zijlstra <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-04compiler.h: unify __UNIQUE_IDNick Desaulniers3-11/+1
commit 6f33d58794ef ("__UNIQUE_ID()") added a fallback definition of __UNIQUE_ID because gcc 4.2 and older did not support __COUNTER__. Also, this commit is effectively a revert of commit b41c29b0527c ("Kbuild: provide a __UNIQUE_ID for clang") which mentions clang 2.6+ supporting __COUNTER__. Documentation/process/changes.rst currently lists the minimum supported version of these compilers as: - gcc: 5.1 - clang: 11.0.0 It should be safe to say that __COUNTER__ is well supported by this point. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Nick Desaulniers <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Jan Beulich <[email protected]> Cc: Luc Van Oostenryck <[email protected]> Cc: Michal rarek <[email protected]> Cc: Nathan Chancellor <[email protected]> Cc: Paul Russel <[email protected]> Cc: Tom Rix <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-04mm/damon/core: implement scheme-specific apply intervalSeongJae Park1-2/+15
DAMON-based operation schemes are applied for every aggregation interval. That was mainly because schemes were using nr_accesses, which be complete to be used for every aggregation interval. However, the schemes are now using nr_accesses_bp, which is updated for each sampling interval in a way that reasonable to be used. Therefore, there is no reason to apply schemes for each aggregation interval. The unnecessary alignment with aggregation interval was also making some use cases of DAMOS tricky. Quotas setting under long aggregation interval is one such example. Suppose the aggregation interval is ten seconds, and there is a scheme having CPU quota 100ms per 1s. The scheme will actually uses 100ms per ten seconds, since it cannobe be applied before next aggregation interval. The feature is working as intended, but the results might not that intuitive for some users. This could be fixed by updating the quota to 1s per 10s. But, in the case, the CPU usage of DAMOS could look like spikes, and would actually make a bad effect to other CPU-sensitive workloads. Implement a dedicated timing interval for each DAMON-based operation scheme, namely apply_interval. The interval will be sampling interval aligned, and each scheme will be applied for its apply_interval. The interval is set to 0 by default, and it means the scheme should use the aggregation interval instead. This avoids old users getting any behavioral difference. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Steven Rostedt (Google) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-04memblock: introduce MEMBLOCK_RSRV_NOINIT flagUsama Arif1-0/+9
For reserved memory regions marked with this flag, reserve_bootmem_region is not called during memmap_init_reserved_pages. This can be used to avoid struct page initialization for regions which won't need them, for e.g. hugepages with Hugepage Vmemmap Optimization enabled. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Usama Arif <[email protected]> Acked-by: Muchun Song <[email protected]> Reviewed-by: Mike Rapoport (IBM) <[email protected]> Cc: Fam Zheng <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Punit Agrawal <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-04mm/damon/core: mark damon_moving_sum() as a static functionSeongJae Park1-2/+0
The function is used by only mm/damon/core.c. Mark it as a static function. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Cc: Brendan Higgins <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-04mm/damon/core: use pseudo-moving sum for nr_accesses_bpSeongJae Park1-3/+9
Let nr_accesses_bp be calculated as a pseudo-moving sum that updated for every sampling interval, using damon_moving_sum(). This is assumed to be useful for cases that the aggregation interval is set quite huge, but the monivoting results need to be collected earlier than next aggregation interval is passed. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Cc: Brendan Higgins <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-04mm/damon/core: introduce nr_accesses_bpSeongJae Park1-0/+5
Add yet another representation of the access rate of each region, namely nr_accesses_bp. It is just same to the nr_accesses but represents the value in basis point (1 in 10,000), and updated at once in every aggregation interval. That is, moving_accesses_bp is just nr_accesses * 10000. This may seems useless at the moment. However, it will be useful for representing less than one nr_accesses value that will be needed to make moving sum-based nr_accesses. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Cc: Brendan Higgins <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-04mm/damon/core: implement a pseudo-moving sum functionSeongJae Park1-0/+2
For values that continuously change, moving average or sum are good ways to provide fast updates while handling temporal and errorneous variability of the value. For example, the access rate counter (nr_accesses) is calculated as a sum of the number of positive sampled access check results that collected during a discrete time window (aggregation interval), and hence it handles temporal and errorneous access check results, but provides the update only for every aggregation interval. Using a moving sum method for that could allow providing the value for every sampling interval. That could be useful for getting monitoring results snapshot or running DAMOS in fine-grained timing. However, supporting the moving sum for cases that number of samples in the time window is arbirary could impose high overhead, since the number of past values that it needs to keep could be too high. The nr_accesses would also be one of the cases. To mitigate the overhead, implement a pseudo-moving sum function that only provides an estimated pseudo-moving sum. It assumes there was no error in last discrete time window and subtract constant portion of last discrete time window sum. Note that the function is not strictly implementing the moving sum, but it keeps a property of moving sum, which makes the value same to the dsicrete-window based sum for each time window-aligned timing. Hence, people collecting the value in the old timings would show no difference. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Cc: Brendan Higgins <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-04mm/damon/core: define and use a dedicated function for region access rate updateSeongJae Park1-1/+4
Patch series "mm/damon: provide pseudo-moving sum based access rate". DAMON checks the access to each region for every sampling interval, increase the access rate counter of the region, namely nr_accesses, if the access was made. For every aggregation interval, the counter is reset. The counter is exposed to users to be used as a metric showing the relative access rate (frequency) of each region. In other words, DAMON provides access rate of each region in every aggregation interval. The aggregation avoids temporal access pattern changes making things confusing. However, this also makes a few DAMON-related operations to unnecessarily need to be aligned to the aggregation interval. This can restrict the flexibility of DAMON applications, especially when the aggregation interval is huge. To provide the monitoring results in finer-grained timing while keeping handling of temporal access pattern change, this patchset implements a pseudo-moving sum based access rate metric. It is pseudo-moving sum because strict moving sum implementation would need to keep all values for last time window, and that could incur high overhead of there could be arbitrary number of values in a time window. Especially in case of the nr_accesses, since the sampling interval and aggregation interval can arbitrarily set and the past values should be maintained for every region, it could be risky. The pseudo-moving sum assumes there were no temporal access pattern change in last discrete time window to remove the needs for keeping the list of the last time window values. As a result, it beocmes not strict moving sum implementation, but provides a reasonable accuracy. Also, it keeps an important property of the moving sum. That is, the moving sum becomes same to discrete-window based sum at the time that aligns to the time window. This means using the pseudo moving sum based nr_accesses makes no change to users who shows the value for every aggregation interval. Patches Sequence ---------------- The sequence of the patches is as follows. The first four patches are for preparation of the change. The first two (patches 1 and 2) implements a helper function for nr_accesses update and eliminate corner case that skips use of the function, respectively. Following two (patches 3 and 4) respectively implement the pseudo-moving sum function and its simple unit test case. Two patches for making DAMON to use the pseudo-moving sum follow. The fifthe one (patch 5) introduces a new field for representing the pseudo-moving sum-based access rate of each region, and the sixth one makes the new representation to actually updated with the pseudo-moving sum function. Last two patches (patches 7 and 8) makes followup fixes for skipping unnecessary updates and marking the moving sum function as static, respectively. This patch (of 8): Each DAMON operarions set is updating nr_accesses field of each damon_region for each of their access check results, from the check_accesses() callback. Directly accessing the field could make things complex to manage and change in future. Define and use a dedicated function for the purpose. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Cc: Brendan Higgins <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-04mm/damon/core: use number of passed access sampling as a timerSeongJae Park1-2/+12
DAMON sleeps for sampling interval after each sampling, and check if the aggregation interval and the ops update interval have passed using ktime_get_coarse_ts64() and baseline timestamps for the intervals. That design is for making the operations occur at deterministic timing regardless of the time that spend for each work. However, it turned out it is not that useful, and incur not-that-intuitive results. After all, timer functions, and especially sleep functions that DAMON uses to wait for specific timing, are not necessarily strictly accurate. It is legal design, so no problem. However, depending on such inaccuracies, the nr_accesses can be larger than aggregation interval divided by sampling interval. For example, with the default setting (5 ms sampling interval and 100 ms aggregation interval) we frequently show regions having nr_accesses larger than 20. Also, if the execution of a DAMOS scheme takes a long time, next aggregation could happen before enough number of samples are collected. This is not what usual users would intuitively expect. Since access check sampling is the smallest unit work of DAMON, using the number of passed sampling intervals as the DAMON-internal timer can easily avoid these problems. That is, convert aggregation and ops update intervals to numbers of sampling intervals that need to be passed before those operations be executed, count the number of passed sampling intervals, and invoke the operations as soon as the specific amount of sampling intervals passed. Make the change. Note that this could make a behavioral change to settings that using intervals that not aligned by the sampling interval. For example, if the sampling interval is 5 ms and the aggregation interval is 12 ms, DAMON effectively uses 15 ms as its aggregation interval, because it checks whether the aggregation interval after sleeping the sampling interval. This change will make DAMON to effectively use 10 ms as aggregation interval, since it uses 'aggregation interval / sampling interval * sampling interval' as the effective aggregation interval, and we don't use floating point types. Usual users would have used aligned intervals, so this behavioral change is not expected to make any meaningful impact, so just make this change. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-04mm, vmscan: remove ISOLATE_UNMAPPEDVlastimil Babka1-2/+0
This isolate_mode_t flag is effectively unused since 89f6c88a6ab4 ("mm: __isolate_lru_page_prepare() in isolate_migratepages_block()") as sc->may_unmap is now checked directly (and only node_reclaim has a mode that sets it to 0). The last remaining place is mm_vmscan_lru_isolate tracepoint for the isolate_mode parameter. That one was mainly used to indicate the active/inactive mode, which the trace-vmscan-postprocess.pl script consumed, but that got silently broken. After fixing the script by the previous patch, it does not need the isolate_mode anymore. So just remove the parameter and with that the whole ISOLATE_UNMAPPED flag. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Vlastimil Babka <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-04buffer: remove __getblk_gfp()Matthew Wilcox (Oracle)1-2/+0
Inline it into __bread_gfp(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Cc: Hui Zhu <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-04ext4: call bdev_getblk() from sb_getblk_gfp()Matthew Wilcox (Oracle)1-3/+3
Most of the callers of sb_getblk_gfp() already assumed that they were passing the entire GFP flags to use. Fix up the two callers that didn't, and remove the __GFP_NOFAIL from them since they both appear to correctly handle failure. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Cc: Hui Zhu <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-04buffer: convert sb_getblk() to call __getblk()Matthew Wilcox (Oracle)1-4/+3
Now that __getblk() is in the right place in the file, it is trivial to call it from sb_getblk(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Cc: Hui Zhu <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-04buffer: convert getblk_unmovable() and __getblk() to use bdev_getblk()Matthew Wilcox (Oracle)1-14/+22
Move these two functions up in the file for the benefit of the next patch, and pass in all of the GFP flags to use instead of the partial GFP flags used by __getblk_gfp(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Cc: Hui Zhu <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-04buffer: hoist GFP flags from grow_dev_page() to __getblk_gfp()Matthew Wilcox (Oracle)1-0/+2
grow_dev_page() is only called by grow_buffers(). grow_buffers() is only called by __getblk_slow() and __getblk_slow() is only called from __getblk_gfp(), so it is safe to move the GFP flags setting all the way up. With that done, add a new bdev_getblk() entry point that leaves the GFP flags the way the caller specified them. [[email protected]: fix grow_dev_page() error handling] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Cc: Hui Zhu <[email protected]> Cc: Dan Carpenter <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-04buffer: pass GFP flags to folio_alloc_buffers()Matthew Wilcox (Oracle)1-1/+1
Patch series "Add and use bdev_getblk()", v2. This patch series fixes a bug reported by Hui Zhu; see proposed patches v1 and v2: https://lore.kernel.org/linux-fsdevel/[email protected]/ https://lore.kernel.org/linux-fsdevel/[email protected]/ I decided to go in a rather different direction for this fix, and fix a related problem at the same time. I don't think there's any urgency to rush this into Linus' tree, nor have I marked it for stable. Reasonable people may disagree. This patch (of 8): Instead of creating entirely new flags, inherit them from grow_dev_page(). The other callers create the same flags that this function used to create. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Cc: Hui Zhu <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-04mm: migrate: convert migrate_misplaced_page() to migrate_misplaced_folio()Kefeng Wang1-2/+2
At present, numa balance only support base page and PMD-mapped THP, but we will expand to support to migrate large folio/pte-mapped THP in the future, it is better to make migrate_misplaced_page() to take a folio instead of a page, and rename it to migrate_misplaced_folio(), it is a preparation, also this remove several compound_head() calls. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kefeng Wang <[email protected]> Reviewed-by: Zi Yan <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: "Huang, Ying" <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Mike Kravetz <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-04mm/rmap: pass folio to hugepage_add_anon_rmap()David Hildenbrand1-1/+1
Let's pass a folio; we are always mapping the entire thing. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Muchun Song <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-04mm: shrinker: make global slab shrink locklessQi Zheng1-0/+24
The shrinker_rwsem is a global read-write lock in shrinkers subsystem, which protects most operations such as slab shrink, registration and unregistration of shrinkers, etc. This can easily cause problems in the following cases. 1) When the memory pressure is high and there are many filesystems mounted or unmounted at the same time, slab shrink will be affected (down_read_trylock() failed). Such as the real workload mentioned by Kirill Tkhai: ``` One of the real workloads from my experience is start of an overcommitted node containing many starting containers after node crash (or many resuming containers after reboot for kernel update). In these cases memory pressure is huge, and the node goes round in long reclaim. ``` 2) If a shrinker is blocked (such as the case mentioned in [1]) and a writer comes in (such as mount a fs), then this writer will be blocked and cause all subsequent shrinker-related operations to be blocked. Even if there is no competitor when shrinking slab, there may still be a problem. The down_read_trylock() may become a perf hotspot with frequent calls to shrink_slab(). Because of the poor multicore scalability of atomic operations, this can lead to a significant drop in IPC (instructions per cycle). We used to implement the lockless slab shrink with SRCU [2], but then kernel test robot reported -88.8% regression in stress-ng.ramfs.ops_per_sec test case [3], so we reverted it [4]. This commit uses the refcount+RCU method [5] proposed by Dave Chinner to re-implement the lockless global slab shrink. The memcg slab shrink is handled in the subsequent patch. For now, all shrinker instances are converted to dynamically allocated and will be freed by call_rcu(). So we can use rcu_read_{lock,unlock}() to ensure that the shrinker instance is valid. And the shrinker instance will not be run again after unregistration. So the structure that records the pointer of shrinker instance can be safely freed without waiting for the RCU read-side critical section. In this way, while we implement the lockless slab shrink, we don't need to be blocked in unregister_shrinker(). The following are the test results: stress-ng --timeout 60 --times --verify --metrics-brief --ramfs 9 & 1) Before applying this patchset: setting to a 60 second run per stressor dispatching hogs: 9 ramfs stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s (secs) (secs) (secs) (real time) (usr+sys time) ramfs 473062 60.00 8.00 279.13 7884.12 1647.59 for a 60.01s run time: 1440.34s available CPU time 7.99s user time ( 0.55%) 279.13s system time ( 19.38%) 287.12s total time ( 19.93%) load average: 7.12 2.99 1.15 successful run completed in 60.01s (1 min, 0.01 secs) 2) After applying this patchset: setting to a 60 second run per stressor dispatching hogs: 9 ramfs stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s (secs) (secs) (secs) (real time) (usr+sys time) ramfs 477165 60.00 8.13 281.34 7952.55 1648.40 for a 60.01s run time: 1440.33s available CPU time 8.12s user time ( 0.56%) 281.34s system time ( 19.53%) 289.46s total time ( 20.10%) load average: 6.98 3.03 1.19 successful run completed in 60.01s (1 min, 0.01 secs) We can see that the ops/s has hardly changed. [1]. https://lore.kernel.org/lkml/[email protected]/ [2]. https://lore.kernel.org/lkml/[email protected]/ [3]. https://lore.kernel.org/lkml/[email protected]/ [4]. https://lore.kernel.org/all/[email protected]/ [5]. https://lore.kernel.org/lkml/[email protected]/ Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Qi Zheng <[email protected]> Cc: Abhinav Kumar <[email protected]> Cc: Alasdair Kergon <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Alyssa Rosenzweig <[email protected]> Cc: Andreas Dilger <[email protected]> Cc: Andreas Gruenbacher <[email protected]> Cc: Anna Schumaker <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Bob Peterson <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Carlos Llamas <[email protected]> Cc: Chandan Babu R <[email protected]> Cc: Chao Yu <[email protected]> Cc: Chris Mason <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Christian Koenig <[email protected]> Cc: Chuck Lever <[email protected]> Cc: Coly Li <[email protected]> Cc: Dai Ngo <[email protected]> Cc: Daniel Vetter <[email protected]> Cc: Daniel Vetter <[email protected]> Cc: "Darrick J. Wong" <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Dave Hansen <[email protected]> Cc: David Airlie <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: David Sterba <[email protected]> Cc: Dmitry Baryshkov <[email protected]> Cc: Gao Xiang <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Huang Rui <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jaegeuk Kim <[email protected]> Cc: Jani Nikula <[email protected]> Cc: Jan Kara <[email protected]> Cc: Jason Wang <[email protected]> Cc: Jeff Layton <[email protected]> Cc: Jeffle Xu <[email protected]> Cc: Joel Fernandes (Google) <[email protected]> Cc: Joonas Lahtinen <[email protected]> Cc: Josef Bacik <[email protected]> Cc: Juergen Gross <[email protected]> Cc: Kent Overstreet <[email protected]> Cc: Kirill Tkhai <[email protected]> Cc: Marijn Suijten <[email protected]> Cc: "Michael S. Tsirkin" <[email protected]> Cc: Mike Snitzer <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Muchun Song <[email protected]> Cc: Muchun Song <[email protected]> Cc: Nadav Amit <[email protected]> Cc: Neil Brown <[email protected]> Cc: Oleksandr Tyshchenko <[email protected]> Cc: Olga Kornievskaia <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Rob Clark <[email protected]> Cc: Rob Herring <[email protected]> Cc: Rodrigo Vivi <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Sean Paul <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Cc: Song Liu <[email protected]> Cc: Stefano Stabellini <[email protected]> Cc: Steven Price <[email protected]> Cc: "Theodore Ts'o" <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Tomeu Vizoso <[email protected]> Cc: Tom Talpey <[email protected]> Cc: Trond Myklebust <[email protected]> Cc: Tvrtko Ursulin <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Xuan Zhuo <[email protected]> Cc: Yue Hu <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-04mm: shrinker: add a secondary array for shrinker_info::{map, nr_deferred}Qi Zheng2-11/+18
Currently, we maintain two linear arrays per node per memcg, which are shrinker_info::map and shrinker_info::nr_deferred. And we need to resize them when the shrinker_nr_max is exceeded, that is, allocate a new array, and then copy the old array to the new array, and finally free the old array by RCU. For shrinker_info::map, we do set_bit() under the RCU lock, so we may set the value into the old map which is about to be freed. This may cause the value set to be lost. The current solution is not to copy the old map when resizing, but to set all the corresponding bits in the new map to 1. This solves the data loss problem, but bring the overhead of more pointless loops while doing memcg slab shrink. For shrinker_info::nr_deferred, we will only modify it under the read lock of shrinker_rwsem, so it will not run concurrently with the resizing. But after we make memcg slab shrink lockless, there will be the same data loss problem as shrinker_info::map, and we can't work around it like the map. For such resizable arrays, the most straightforward idea is to change it to xarray, like we did for list_lru [1]. We need to do xa_store() in the list_lru_add()-->set_shrinker_bit(), but this will cause memory allocation, and the list_lru_add() doesn't accept failure. A possible solution is to pre-allocate, but the location of pre-allocation is not well determined (such as deferred_split_shrinker case). Therefore, this commit chooses to introduce the following secondary array for shrinker_info::{map, nr_deferred}: +---------------+--------+--------+-----+ | shrinker_info | unit 0 | unit 1 | ... | (secondary array) +---------------+--------+--------+-----+ | v +---------------+-----+ | nr_deferred[] | map | (leaf array) +---------------+-----+ (shrinker_info_unit) The leaf array is never freed unless the memcg is destroyed. The secondary array will be resized every time the shrinker id exceeds shrinker_nr_max. So the shrinker_info_unit can be indexed from both the old and the new shrinker_info->unit[x]. Then even if we get the old secondary array under the RCU lock, the found map and nr_deferred are also true, so the updated nr_deferred and map will not be lost. [1]. https://lore.kernel.org/all/[email protected]/ [[email protected]: unlock the &shrinker_rwsem before the call to free_shrinker_info()] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Qi Zheng <[email protected]> Reviewed-by: Muchun Song <[email protected]> Cc: Abhinav Kumar <[email protected]> Cc: Alasdair Kergon <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Alyssa Rosenzweig <[email protected]> Cc: Andreas Dilger <[email protected]> Cc: Andreas Gruenbacher <[email protected]> Cc: Anna Schumaker <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Bob Peterson <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Carlos Llamas <[email protected]> Cc: Chandan Babu R <[email protected]> Cc: Chao Yu <[email protected]> Cc: Chris Mason <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Christian Koenig <[email protected]> Cc: Chuck Lever <[email protected]> Cc: Coly Li <[email protected]> Cc: Dai Ngo <[email protected]> Cc: Daniel Vetter <[email protected]> Cc: Daniel Vetter <[email protected]> Cc: "Darrick J. Wong" <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Dave Hansen <[email protected]> Cc: David Airlie <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: David Sterba <[email protected]> Cc: Dmitry Baryshkov <[email protected]> Cc: Gao Xiang <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Huang Rui <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jaegeuk Kim <[email protected]> Cc: Jani Nikula <[email protected]> Cc: Jan Kara <[email protected]> Cc: Jason Wang <[email protected]> Cc: Jeff Layton <[email protected]> Cc: Jeffle Xu <[email protected]> Cc: Joel Fernandes (Google) <[email protected]> Cc: Joonas Lahtinen <[email protected]> Cc: Josef Bacik <[email protected]> Cc: Juergen Gross <[email protected]> Cc: Kent Overstreet <[email protected]> Cc: Kirill Tkhai <[email protected]> Cc: Marijn Suijten <[email protected]> Cc: "Michael S. Tsirkin" <[email protected]> Cc: Mike Snitzer <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Muchun Song <[email protected]> Cc: Nadav Amit <[email protected]> Cc: Neil Brown <[email protected]> Cc: Oleksandr Tyshchenko <[email protected]> Cc: Olga Kornievskaia <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Rob Clark <[email protected]> Cc: Rob Herring <[email protected]> Cc: Rodrigo Vivi <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Sean Paul <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Cc: Song Liu <[email protected]> Cc: Stefano Stabellini <[email protected]> Cc: Steven Price <[email protected]> Cc: "Theodore Ts'o" <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Tomeu Vizoso <[email protected]> Cc: Tom Talpey <[email protected]> Cc: Trond Myklebust <[email protected]> Cc: Tvrtko Ursulin <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Xuan Zhuo <[email protected]> Cc: Yue Hu <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-04mm: shrinker: remove old APIsQi Zheng1-8/+0
Now no users are using the old APIs, just remove them. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Qi Zheng <[email protected]> Reviewed-by: Muchun Song <[email protected]> Cc: Abhinav Kumar <[email protected]> Cc: Alasdair Kergon <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Alyssa Rosenzweig <[email protected]> Cc: Andreas Dilger <[email protected]> Cc: Andreas Gruenbacher <[email protected]> Cc: Anna Schumaker <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Bob Peterson <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Carlos Llamas <[email protected]> Cc: Chandan Babu R <[email protected]> Cc: Chao Yu <[email protected]> Cc: Chris Mason <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Christian Koenig <[email protected]> Cc: Chuck Lever <[email protected]> Cc: Coly Li <[email protected]> Cc: Dai Ngo <[email protected]> Cc: Daniel Vetter <[email protected]> Cc: Daniel Vetter <[email protected]> Cc: "Darrick J. Wong" <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Dave Hansen <[email protected]> Cc: David Airlie <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: David Sterba <[email protected]> Cc: Dmitry Baryshkov <[email protected]> Cc: Gao Xiang <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Huang Rui <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jaegeuk Kim <[email protected]> Cc: Jani Nikula <[email protected]> Cc: Jan Kara <[email protected]> Cc: Jason Wang <[email protected]> Cc: Jeff Layton <[email protected]> Cc: Jeffle Xu <[email protected]> Cc: Joel Fernandes (Google) <[email protected]> Cc: Joonas Lahtinen <[email protected]> Cc: Josef Bacik <[email protected]> Cc: Juergen Gross <[email protected]> Cc: Kent Overstreet <[email protected]> Cc: Kirill Tkhai <[email protected]> Cc: Marijn Suijten <[email protected]> Cc: "Michael S. Tsirkin" <[email protected]> Cc: Mike Snitzer <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Muchun Song <[email protected]> Cc: Nadav Amit <[email protected]> Cc: Neil Brown <[email protected]> Cc: Oleksandr Tyshchenko <[email protected]> Cc: Olga Kornievskaia <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Rob Clark <[email protected]> Cc: Rob Herring <[email protected]> Cc: Rodrigo Vivi <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Sean Paul <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Cc: Song Liu <[email protected]> Cc: Stefano Stabellini <[email protected]> Cc: Steven Price <[email protected]> Cc: "Theodore Ts'o" <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Tomeu Vizoso <[email protected]> Cc: Tom Talpey <[email protected]> Cc: Trond Myklebust <[email protected]> Cc: Tvrtko Ursulin <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Xuan Zhuo <[email protected]> Cc: Yue Hu <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-04fs: super: dynamically allocate the s_shrinkQi Zheng1-1/+1
In preparation for implementing lockless slab shrink, use new APIs to dynamically allocate the s_shrink, so that it can be freed asynchronously via RCU. Then it doesn't need to wait for RCU read-side critical section when releasing the struct super_block. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Qi Zheng <[email protected]> Reviewed-by: Muchun Song <[email protected]> Acked-by: David Sterba <[email protected]> Cc: Chris Mason <[email protected]> Cc: Josef Bacik <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Abhinav Kumar <[email protected]> Cc: Alasdair Kergon <[email protected]> Cc: Alyssa Rosenzweig <[email protected]> Cc: Andreas Dilger <[email protected]> Cc: Andreas Gruenbacher <[email protected]> Cc: Anna Schumaker <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Bob Peterson <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Carlos Llamas <[email protected]> Cc: Chandan Babu R <[email protected]> Cc: Chao Yu <[email protected]> Cc: Christian Koenig <[email protected]> Cc: Chuck Lever <[email protected]> Cc: Coly Li <[email protected]> Cc: Dai Ngo <[email protected]> Cc: Daniel Vetter <[email protected]> Cc: Daniel Vetter <[email protected]> Cc: "Darrick J. Wong" <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Dave Hansen <[email protected]> Cc: David Airlie <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Dmitry Baryshkov <[email protected]> Cc: Gao Xiang <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Huang Rui <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jaegeuk Kim <[email protected]> Cc: Jani Nikula <[email protected]> Cc: Jan Kara <[email protected]> Cc: Jason Wang <[email protected]> Cc: Jeff Layton <[email protected]> Cc: Jeffle Xu <[email protected]> Cc: Joel Fernandes (Google) <[email protected]> Cc: Joonas Lahtinen <[email protected]> Cc: Juergen Gross <[email protected]> Cc: Kent Overstreet <[email protected]> Cc: Kirill Tkhai <[email protected]> Cc: Marijn Suijten <[email protected]> Cc: "Michael S. Tsirkin" <[email protected]> Cc: Mike Snitzer <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Muchun Song <[email protected]> Cc: Nadav Amit <[email protected]> Cc: Neil Brown <[email protected]> Cc: Oleksandr Tyshchenko <[email protected]> Cc: Olga Kornievskaia <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Rob Clark <[email protected]> Cc: Rob Herring <[email protected]> Cc: Rodrigo Vivi <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Sean Paul <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Cc: Song Liu <[email protected]> Cc: Stefano Stabellini <[email protected]> Cc: Steven Price <[email protected]> Cc: "Theodore Ts'o" <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Tomeu Vizoso <[email protected]> Cc: Tom Talpey <[email protected]> Cc: Trond Myklebust <[email protected]> Cc: Tvrtko Ursulin <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Xuan Zhuo <[email protected]> Cc: Yue Hu <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-04jbd2,ext4: dynamically allocate the jbd2-journal shrinkerQi Zheng1-1/+1
In preparation for implementing lockless slab shrink, use new APIs to dynamically allocate the jbd2-journal shrinker, so that it can be freed asynchronously via RCU. Then it doesn't need to wait for RCU read-side critical section when releasing the struct journal_s. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Qi Zheng <[email protected]> Reviewed-by: Muchun Song <[email protected]> Acked-by: Jan Kara <[email protected]> Cc: "Theodore Ts'o" <[email protected]> Cc: Abhinav Kumar <[email protected]> Cc: Alasdair Kergon <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Alyssa Rosenzweig <[email protected]> Cc: Andreas Dilger <[email protected]> Cc: Andreas Gruenbacher <[email protected]> Cc: Anna Schumaker <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Bob Peterson <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Carlos Llamas <[email protected]> Cc: Chandan Babu R <[email protected]> Cc: Chao Yu <[email protected]> Cc: Chris Mason <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Christian Koenig <[email protected]> Cc: Chuck Lever <[email protected]> Cc: Coly Li <[email protected]> Cc: Dai Ngo <[email protected]> Cc: Daniel Vetter <[email protected]> Cc: Daniel Vetter <[email protected]> Cc: "Darrick J. Wong" <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Dave Hansen <[email protected]> Cc: David Airlie <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: David Sterba <[email protected]> Cc: Dmitry Baryshkov <[email protected]> Cc: Gao Xiang <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Huang Rui <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jaegeuk Kim <[email protected]> Cc: Jani Nikula <[email protected]> Cc: Jason Wang <[email protected]> Cc: Jeff Layton <[email protected]> Cc: Jeffle Xu <[email protected]> Cc: Joel Fernandes (Google) <[email protected]> Cc: Joonas Lahtinen <[email protected]> Cc: Josef Bacik <[email protected]> Cc: Juergen Gross <[email protected]> Cc: Kent Overstreet <[email protected]> Cc: Kirill Tkhai <[email protected]> Cc: Marijn Suijten <[email protected]> Cc: "Michael S. Tsirkin" <[email protected]> Cc: Mike Snitzer <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Muchun Song <[email protected]> Cc: Nadav Amit <[email protected]> Cc: Neil Brown <[email protected]> Cc: Oleksandr Tyshchenko <[email protected]> Cc: Olga Kornievskaia <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Rob Clark <[email protected]> Cc: Rob Herring <[email protected]> Cc: Rodrigo Vivi <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Sean Paul <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Cc: Song Liu <[email protected]> Cc: Stefano Stabellini <[email protected]> Cc: Steven Price <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Tomeu Vizoso <[email protected]> Cc: Tom Talpey <[email protected]> Cc: Trond Myklebust <[email protected]> Cc: Tvrtko Ursulin <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Xuan Zhuo <[email protected]> Cc: Yue Hu <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-04mm: shrinker: add infrastructure for dynamically allocating shrinkerQi Zheng1-5/+14
Patch series "use refcount+RCU method to implement lockless slab shrink", v6. 1. Background ============= We used to implement the lockless slab shrink with SRCU [1], but then kernel test robot reported -88.8% regression in stress-ng.ramfs.ops_per_sec test case [2], so we reverted it [3]. This patch series aims to re-implement the lockless slab shrink using the refcount+RCU method proposed by Dave Chinner [4]. [1]. https://lore.kernel.org/lkml/[email protected]/ [2]. https://lore.kernel.org/lkml/[email protected]/ [3]. https://lore.kernel.org/all/[email protected]/ [4]. https://lore.kernel.org/lkml/[email protected]/ 2. Implementation ================= Currently, the shrinker instances can be divided into the following three types: a) global shrinker instance statically defined in the kernel, such as workingset_shadow_shrinker. b) global shrinker instance statically defined in the kernel modules, such as mmu_shrinker in x86. c) shrinker instance embedded in other structures. For case a, the memory of shrinker instance is never freed. For case b, the memory of shrinker instance will be freed after synchronize_rcu() when the module is unloaded. For case c, the memory of shrinker instance will be freed along with the structure it is embedded in. In preparation for implementing lockless slab shrink, we need to dynamically allocate those shrinker instances in case c, then the memory can be dynamically freed alone by calling kfree_rcu(). This patchset adds the following new APIs for dynamically allocating shrinker, and add a private_data field to struct shrinker to record and get the original embedded structure. 1. shrinker_alloc() 2. shrinker_register() 3. shrinker_free() In order to simplify shrinker-related APIs and make shrinker more independent of other kernel mechanisms, this patchset uses the above APIs to convert all shrinkers (including case a and b) to dynamically allocated, and then remove all existing APIs. This will also have another advantage mentioned by Dave Chinner: ``` The other advantage of this is that it will break all the existing out of tree code and third party modules using the old API and will no longer work with a kernel using lockless slab shrinkers. They need to break (both at the source and binary levels) to stop bad things from happening due to using uncoverted shrinkers in the new setup. ``` Then we free the shrinker by calling call_rcu(), and use rcu_read_{lock,unlock}() to ensure that the shrinker instance is valid. And the shrinker::refcount mechanism ensures that the shrinker instance will not be run again after unregistration. So the structure that records the pointer of shrinker instance can be safely freed without waiting for the RCU read-side critical section. In this way, while we implement the lockless slab shrink, we don't need to be blocked in unregister_shrinker() to wait RCU read-side critical section. PATCH 1: introduce new APIs PATCH 2~38: convert all shrinnkers to use new APIs PATCH 39: remove old APIs PATCH 40~41: some cleanups and preparations PATCH 42-43: implement the lockless slab shrink PATCH 44~45: convert shrinker_rwsem to mutex 3. Testing ========== 3.1 slab shrink stress test --------------------------- We can reproduce the down_read_trylock() hotspot through the following script: ``` DIR="/root/shrinker/memcg/mnt" do_create() { mkdir -p /sys/fs/cgroup/memory/test echo 4G > /sys/fs/cgroup/memory/test/memory.limit_in_bytes for i in `seq 0 $1`; do mkdir -p /sys/fs/cgroup/memory/test/$i; echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs; mkdir -p $DIR/$i; done } do_mount() { for i in `seq $1 $2`; do mount -t tmpfs $i $DIR/$i; done } do_touch() { for i in `seq $1 $2`; do echo $$ > /sys/fs/cgroup/memory/test/$i/cgroup.procs; dd if=/dev/zero of=$DIR/$i/file$i bs=1M count=1 & done } case "$1" in touch) do_touch $2 $3 ;; test) do_create 4000 do_mount 0 4000 do_touch 0 3000 ;; *) exit 1 ;; esac ``` Save the above script, then run test and touch commands. Then we can use the following perf command to view hotspots: perf top -U -F 999 1) Before applying this patchset: 33.15% [kernel] [k] down_read_trylock 25.38% [kernel] [k] shrink_slab 21.75% [kernel] [k] up_read 4.45% [kernel] [k] _find_next_bit 2.27% [kernel] [k] do_shrink_slab 1.80% [kernel] [k] intel_idle_irq 1.79% [kernel] [k] shrink_lruvec 0.67% [kernel] [k] xas_descend 0.41% [kernel] [k] mem_cgroup_iter 0.40% [kernel] [k] shrink_node 0.38% [kernel] [k] list_lru_count_one 2) After applying this patchset: 64.56% [kernel] [k] shrink_slab 12.18% [kernel] [k] do_shrink_slab 3.30% [kernel] [k] __rcu_read_unlock 2.61% [kernel] [k] shrink_lruvec 2.49% [kernel] [k] __rcu_read_lock 1.93% [kernel] [k] intel_idle_irq 0.89% [kernel] [k] shrink_node 0.81% [kernel] [k] mem_cgroup_iter 0.77% [kernel] [k] mem_cgroup_calculate_protection 0.66% [kernel] [k] list_lru_count_one We can see that the first perf hotspot becomes shrink_slab, which is what we expect. 3.2 registration and unregistration stress test ----------------------------------------------- Run the command below to test: stress-ng --timeout 60 --times --verify --metrics-brief --ramfs 9 & 1) Before applying this patchset: setting to a 60 second run per stressor dispatching hogs: 9 ramfs stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s (secs) (secs) (secs) (real time) (usr+sys time) ramfs 473062 60.00 8.00 279.13 7884.12 1647.59 for a 60.01s run time: 1440.34s available CPU time 7.99s user time ( 0.55%) 279.13s system time ( 19.38%) 287.12s total time ( 19.93%) load average: 7.12 2.99 1.15 successful run completed in 60.01s (1 min, 0.01 secs) 2) After applying this patchset: setting to a 60 second run per stressor dispatching hogs: 9 ramfs stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s (secs) (secs) (secs) (real time) (usr+sys time) ramfs 477165 60.00 8.13 281.34 7952.55 1648.40 for a 60.01s run time: 1440.33s available CPU time 8.12s user time ( 0.56%) 281.34s system time ( 19.53%) 289.46s total time ( 20.10%) load average: 6.98 3.03 1.19 successful run completed in 60.01s (1 min, 0.01 secs) We can see that the ops/s has hardly changed. This patch (of 45): Currently, the shrinker instances can be divided into the following three types: a) global shrinker instance statically defined in the kernel, such as workingset_shadow_shrinker. b) global shrinker instance statically defined in the kernel modules, such as mmu_shrinker in x86. c) shrinker instance embedded in other structures. For case a, the memory of shrinker instance is never freed. For case b, the memory of shrinker instance will be freed after synchronize_rcu() when the module is unloaded. For case c, the memory of shrinker instance will be freed along with the structure it is embedded in. In preparation for implementing lockless slab shrink, we need to dynamically allocate those shrinker instances in case c, then the memory can be dynamically freed alone by calling kfree_rcu(). So this commit adds the following new APIs for dynamically allocating shrinker, and add a private_data field to struct shrinker to record and get the original embedded structure. 1. shrinker_alloc() Used to allocate shrinker instance itself and related memory, it will return a pointer to the shrinker instance on success and NULL on failure. 2. shrinker_register() Used to register the shrinker instance, which is same as the current register_shrinker_prepared(). 3. shrinker_free() Used to unregister (if needed) and free the shrinker instance. In order to simplify shrinker-related APIs and make shrinker more independent of other kernel mechanisms, subsequent submissions will use the above API to convert all shrinkers (including case a and b) to dynamically allocated, and then remove all existing APIs. This will also have another advantage mentioned by Dave Chinner: ``` The other advantage of this is that it will break all the existing out of tree code and third party modules using the old API and will no longer work with a kernel using lockless slab shrinkers. They need to break (both at the source and binary levels) to stop bad things from happening due to using unconverted shrinkers in the new setup. ``` [[email protected]: mm: shrinker: some cleanup] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Qi Zheng <[email protected]> Reviewed-by: Muchun Song <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Chuck Lever <[email protected]> Cc: Darrick J. Wong <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Kirill Tkhai <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Cc: Steven Price <[email protected]> Cc: Theodore Ts'o <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Abhinav Kumar <[email protected]> Cc: Alasdair Kergon <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Alyssa Rosenzweig <[email protected]> Cc: Andreas Dilger <[email protected]> Cc: Andreas Gruenbacher <[email protected]> Cc: Anna Schumaker <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Bob Peterson <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Carlos Llamas <[email protected]> Cc: Chandan Babu R <[email protected]> Cc: Chao Yu <[email protected]> Cc: Chris Mason <[email protected]> Cc: Christian Koenig <[email protected]> Cc: Coly Li <[email protected]> Cc: Dai Ngo <[email protected]> Cc: Daniel Vetter <[email protected]> Cc: Daniel Vetter <[email protected]> Cc: Dave Hansen <[email protected]> Cc: David Airlie <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: David Sterba <[email protected]> Cc: Dmitry Baryshkov <[email protected]> Cc: Gao Xiang <[email protected]> Cc: Huang Rui <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jaegeuk Kim <[email protected]> Cc: Jani Nikula <[email protected]> Cc: Jan Kara <[email protected]> Cc: Jason Wang <[email protected]> Cc: Jeff Layton <[email protected]> Cc: Jeffle Xu <[email protected]> Cc: Joel Fernandes (Google) <[email protected]> Cc: Joonas Lahtinen <[email protected]> Cc: Josef Bacik <[email protected]> Cc: Juergen Gross <[email protected]> Cc: Kent Overstreet <[email protected]> Cc: Marijn Suijten <[email protected]> Cc: "Michael S. Tsirkin" <[email protected]> Cc: Mike Snitzer <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Nadav Amit <[email protected]> Cc: Neil Brown <[email protected]> Cc: Oleksandr Tyshchenko <[email protected]> Cc: Olga Kornievskaia <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Rob Clark <[email protected]> Cc: Rob Herring <[email protected]> Cc: Rodrigo Vivi <[email protected]> Cc: Sean Paul <[email protected]> Cc: Song Liu <[email protected]> Cc: Stefano Stabellini <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Tomeu Vizoso <[email protected]> Cc: Tom Talpey <[email protected]> Cc: Trond Myklebust <[email protected]> Cc: Tvrtko Ursulin <[email protected]> Cc: Xuan Zhuo <[email protected]> Cc: Yue Hu <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-04drm/ttm: introduce pool_shrink_rwsemQi Zheng1-1/+0
Currently, synchronize_shrinkers() is only used by TTM pool. It only requires that no shrinkers run in parallel. After we use RCU+refcount method to implement the lockless slab shrink, we can not use shrinker_rwsem or synchronize_rcu() to guarantee that all shrinker invocations have seen an update before freeing memory. So we introduce a new pool_shrink_rwsem to implement a private ttm_pool_synchronize_shrinkers(), so as to achieve the same purpose. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Qi Zheng <[email protected]> Reviewed-by: Muchun Song <[email protected]> Reviewed-by: Christian König <[email protected]> Acked-by: Daniel Vetter <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Chuck Lever <[email protected]> Cc: Daniel Vetter <[email protected]> Cc: Darrick J. Wong <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Joel Fernandes <[email protected]> Cc: Kirill Tkhai <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Cc: Steven Price <[email protected]> Cc: Theodore Ts'o <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Abhinav Kumar <[email protected]> Cc: Alasdair Kergon <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Alyssa Rosenzweig <[email protected]> Cc: Andreas Dilger <[email protected]> Cc: Andreas Gruenbacher <[email protected]> Cc: Anna Schumaker <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Bob Peterson <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Carlos Llamas <[email protected]> Cc: Chandan Babu R <[email protected]> Cc: Chao Yu <[email protected]> Cc: Chris Mason <[email protected]> Cc: Coly Li <[email protected]> Cc: Dai Ngo <[email protected]> Cc: Dave Hansen <[email protected]> Cc: David Airlie <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: David Sterba <[email protected]> Cc: Dmitry Baryshkov <[email protected]> Cc: Gao Xiang <[email protected]> Cc: Huang Rui <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jaegeuk Kim <[email protected]> Cc: Jani Nikula <[email protected]> Cc: Jan Kara <[email protected]> Cc: Jason Wang <[email protected]> Cc: Jeff Layton <[email protected]> Cc: Jeffle Xu <[email protected]> Cc: Joonas Lahtinen <[email protected]> Cc: Josef Bacik <[email protected]> Cc: Juergen Gross <[email protected]> Cc: Kent Overstreet <[email protected]> Cc: Marijn Suijten <[email protected]> Cc: "Michael S. Tsirkin" <[email protected]> Cc: Mike Snitzer <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Muchun Song <[email protected]> Cc: Nadav Amit <[email protected]> Cc: Neil Brown <[email protected]> Cc: Oleksandr Tyshchenko <[email protected]> Cc: Olga Kornievskaia <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Rob Clark <[email protected]> Cc: Rob Herring <[email protected]> Cc: Rodrigo Vivi <[email protected]> Cc: Sean Paul <[email protected]> Cc: Song Liu <[email protected]> Cc: Stefano Stabellini <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Tomeu Vizoso <[email protected]> Cc: Tom Talpey <[email protected]> Cc: Trond Myklebust <[email protected]> Cc: Tvrtko Ursulin <[email protected]> Cc: Xuan Zhuo <[email protected]> Cc: Yue Hu <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-04mm: move some shrinker-related function declarations to mm/internal.hQi Zheng1-19/+0
Patch series "cleanups for lockless slab shrink", v4. This series is some cleanups for lockless slab shrink. This patch (of 4): The following functions are only used inside the mm subsystem, so it's better to move their declarations to the mm/internal.h file. 1. shrinker_debugfs_add() 2. shrinker_debugfs_detach() 3. shrinker_debugfs_remove() Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Qi Zheng <[email protected]> Reviewed-by: Muchun Song <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Christian König <[email protected]> Cc: Chuck Lever <[email protected]> Cc: Daniel Vetter <[email protected]> Cc: Darrick J. Wong <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Joel Fernandes <[email protected]> Cc: Kirill Tkhai <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Cc: Steven Price <[email protected]> Cc: Theodore Ts'o <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Daniel Vetter <[email protected]> Cc: Abhinav Kumar <[email protected]> Cc: Alasdair Kergon <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Alyssa Rosenzweig <[email protected]> Cc: Andreas Dilger <[email protected]> Cc: Andreas Gruenbacher <[email protected]> Cc: Anna Schumaker <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Bob Peterson <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Carlos Llamas <[email protected]> Cc: Chandan Babu R <[email protected]> Cc: Chao Yu <[email protected]> Cc: Chris Mason <[email protected]> Cc: Coly Li <[email protected]> Cc: Dai Ngo <[email protected]> Cc: Dave Hansen <[email protected]> Cc: David Airlie <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: David Sterba <[email protected]> Cc: Dmitry Baryshkov <[email protected]> Cc: Gao Xiang <[email protected]> Cc: Huang Rui <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jaegeuk Kim <[email protected]> Cc: Jani Nikula <[email protected]> Cc: Jan Kara <[email protected]> Cc: Jason Wang <[email protected]> Cc: Jeff Layton <[email protected]> Cc: Jeffle Xu <[email protected]> Cc: Joonas Lahtinen <[email protected]> Cc: Josef Bacik <[email protected]> Cc: Juergen Gross <[email protected]> Cc: Kent Overstreet <[email protected]> Cc: Marijn Suijten <[email protected]> Cc: "Michael S. Tsirkin" <[email protected]> Cc: Mike Snitzer <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Muchun Song <[email protected]> Cc: Nadav Amit <[email protected]> Cc: Neil Brown <[email protected]> Cc: Oleksandr Tyshchenko <[email protected]> Cc: Olga Kornievskaia <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Rob Clark <[email protected]> Cc: Rob Herring <[email protected]> Cc: Rodrigo Vivi <[email protected]> Cc: Sean Paul <[email protected]> Cc: Song Liu <[email protected]> Cc: Stefano Stabellini <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Tomeu Vizoso <[email protected]> Cc: Tom Talpey <[email protected]> Cc: Trond Myklebust <[email protected]> Cc: Tvrtko Ursulin <[email protected]> Cc: Xuan Zhuo <[email protected]> Cc: Yue Hu <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-04efi/unaccepted: do not let /proc/vmcore try to access unaccepted memoryAdrian Hunter1-0/+7
Patch series "Do not try to access unaccepted memory", v2. Support for unaccepted memory was added recently, refer commit dcdfdd40fa82 ("mm: Add support for unaccepted memory"), whereby a virtual machine may need to accept memory before it can be used. Plug a few gaps where RAM is exposed without checking if it is unaccepted memory. This patch (of 2): Support for unaccepted memory was added recently, refer commit dcdfdd40fa82 ("mm: Add support for unaccepted memory"), whereby a virtual machine may need to accept memory before it can be used. Do not let /proc/vmcore try to access unaccepted memory because it can cause the guest to fail. For /proc/vmcore, which is read-only, this means a read or mmap of unaccepted memory will return zeros. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Adrian Hunter <[email protected]> Cc: Ard Biesheuvel <[email protected]> Cc: Baoquan He <[email protected]> Cc: Borislav Petkov (AMD) <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Dave Young <[email protected]> Cc: Kirill A. Shutemov <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Tom Lendacky <[email protected]> Cc: Vivek Goyal <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-04mm/damon/core: remove duplicated comment for watermarks-based deactivationSeongJae Park1-3/+0
The comment for explaining about watermarks-based monitoring part deactivation is duplicated in two paragraphs. Remove one. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Steven Rostedt (Google) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-04mm/damon/core: add more comments for nr_accessesSeongJae Park1-7/+12
The comment on struct damon_region about nr_accesses field looks not sufficient. Many people actually used to ask what nr_accesses mean. There is more detailed explanation of the mechanism on the comment for struct damon_attrs, but it is also ambiguous, as it doesn't specify the name of the counter for aggregating the access check results. Make those more detailed. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Steven Rostedt (Google) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-04mm/mremap: allow moves within the same VMA for stack movesJoel Fernandes (Google)1-1/+1
For the stack move happening in shift_arg_pages(), the move is happening within the same VMA which spans the old and new ranges. In case the aligned address happens to fall within that VMA, allow such moves and don't abort the mremap alignment optimization. In the regular non-stack mremap case, we cannot allow any such moves as will end up destroying some part of the mapping (either the source of the move, or part of the existing mapping). So just avoid it for stack moves. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Joel Fernandes (Google) <[email protected]> Reviewed-by: Lorenzo Stoakes <[email protected]> Cc: Kalesh Singh <[email protected]> Cc: "Kirill A. Shutemov" <[email protected]> Cc: Liam R. Howlett <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Lokesh Gidra <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-04mm: convert DAX lock/unlock page to lock/unlock folioMatthew Wilcox (Oracle)1-5/+5
The one caller of DAX lock/unlock page already calls compound_head(), so use page_folio() instead, then use a folio throughout the DAX code to remove uses of page->mapping and page->index. [[email protected]: add comment to mf_generic_kill_procss(), simplify mf_generic_kill_procs:folio initialization] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Jane Chu <[email protected]> Acked-by: Naoya Horiguchi <[email protected]> Cc: Dan Williams <[email protected]> Cc: Jane Chu <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-04mm: remove remnants of SPLIT_RSS_COUNTINGMateusz Guzik1-8/+0
The feature got retired in f1a7941243c1 ("mm: convert mm's rss stats into percpu_counter"), but the patch failed to fully clean it up. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Mateusz Guzik <[email protected]> Acked-by: Shakeel Butt <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-04rcu: Assume rcu_report_dead() is always called locallyFrederic Weisbecker1-1/+1
rcu_report_dead() has to be called locally by the CPU that is going to exit the RCU state machine. Passing a cpu argument here is error-prone and leaves the possibility for a racy remote call. Use local access instead. Reviewed-by: Paul E. McKenney <[email protected]> Signed-off-by: Frederic Weisbecker <[email protected]>
2023-10-04cpuidle, ACPI: Evaluate LPI arch_flags for broadcast timerOza Pawandeep1-0/+9
Arm® Functional Fixed Hardware Specification defines LPI states, which provide an architectural context loss flags field that can be used to describe the context that might be lost when an LPI state is entered. - Core context Lost - General purpose registers. - Floating point and SIMD registers. - System registers, include the System register based - generic timer for the core. - Debug register in the core power domain. - PMU registers in the core power domain. - Trace register in the core power domain. - Trace context loss - GICR - GICD Qualcomm's custom CPUs preserves the architectural state, including keeping the power domain for local timers active. when core is power gated, the local timers are sufficient to wake the core up without needing broadcast timer. The patch fixes the evaluation of cpuidle arch_flags, and moves only to broadcast timer if core context lost is defined in ACPI LPI. Fixes: a36a7fecfe60 ("ACPI / processor_idle: Add support for Low Power Idle(LPI) states") Reviewed-by: Sudeep Holla <[email protected]> Acked-by: Rafael J. Wysocki <[email protected]> Signed-off-by: Oza Pawandeep <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]>
2023-10-04Merge tag 'for-netdev' of ↵Jakub Kicinski1-1/+1
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Daniel Borkmann says: ==================== pull-request: bpf 2023-10-02 We've added 11 non-merge commits during the last 12 day(s) which contain a total of 12 files changed, 176 insertions(+), 41 deletions(-). The main changes are: 1) Fix BPF verifier to reset backtrack_state masks on global function exit as otherwise subsequent precision tracking would reuse them, from Andrii Nakryiko. 2) Several sockmap fixes for available bytes accounting, from John Fastabend. 3) Reject sk_msg egress redirects to non-TCP sockets given this is only supported for TCP sockets today, from Jakub Sitnicki. 4) Fix a syzkaller splat in bpf_mprog when hitting maximum program limits with BPF_F_BEFORE directive, from Daniel Borkmann and Nikolay Aleksandrov. 5) Fix BPF memory allocator to use kmalloc_size_roundup() to adjust size_index for selecting a bpf_mem_cache, from Hou Tao. 6) Fix arch_prepare_bpf_trampoline return code for s390 JIT, from Song Liu. 7) Fix bpf_trampoline_get when CONFIG_BPF_JIT is turned off, from Leon Hwang. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: bpf: Use kmalloc_size_roundup() to adjust size_index selftest/bpf: Add various selftests for program limits bpf, mprog: Fix maximum program check on mprog attachment bpf, sockmap: Reject sk_msg egress redirects to non-TCP sockets bpf, sockmap: Add tests for MSG_F_PEEK bpf, sockmap: Do not inc copied_seq when PEEK flag set bpf: tcp_read_skb needs to pop skb regardless of seq bpf: unconditionally reset backtrack_state masks on global func exit bpf: Fix tr dereferencing selftests/bpf: Check bpf_cubic_acked() is called via struct_ops s390/bpf: Let arch_prepare_bpf_trampoline return program size ==================== Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2023-10-04dmaengine: Remove unused declaration dma_chan_cleanup()Yue Haibing1-2/+0
Commit f27c580c3628 ("dmaengine: remove 'bigref' infrastructure") removed the implementation but left declaration in place. Remove it. Signed-off-by: Yue Haibing <[email protected]> Reviewed-by: Andy Shevchenko <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Vinod Koul <[email protected]>
2023-10-04netfilter: handle the connecting collision properly in nf_conntrack_proto_sctpXin Long1-0/+1
In Scenario A and B below, as the delayed INIT_ACK always changes the peer vtag, SCTP ct with the incorrect vtag may cause packet loss. Scenario A: INIT_ACK is delayed until the peer receives its own INIT_ACK 192.168.1.2 > 192.168.1.1: [INIT] [init tag: 1328086772] 192.168.1.1 > 192.168.1.2: [INIT] [init tag: 1414468151] 192.168.1.2 > 192.168.1.1: [INIT ACK] [init tag: 1328086772] 192.168.1.1 > 192.168.1.2: [INIT ACK] [init tag: 1650211246] * 192.168.1.2 > 192.168.1.1: [COOKIE ECHO] 192.168.1.1 > 192.168.1.2: [COOKIE ECHO] 192.168.1.2 > 192.168.1.1: [COOKIE ACK] Scenario B: INIT_ACK is delayed until the peer completes its own handshake 192.168.1.2 > 192.168.1.1: sctp (1) [INIT] [init tag: 3922216408] 192.168.1.1 > 192.168.1.2: sctp (1) [INIT] [init tag: 144230885] 192.168.1.2 > 192.168.1.1: sctp (1) [INIT ACK] [init tag: 3922216408] 192.168.1.1 > 192.168.1.2: sctp (1) [COOKIE ECHO] 192.168.1.2 > 192.168.1.1: sctp (1) [COOKIE ACK] 192.168.1.1 > 192.168.1.2: sctp (1) [INIT ACK] [init tag: 3914796021] * This patch fixes it as below: In SCTP_CID_INIT processing: - clear ct->proto.sctp.init[!dir] if ct->proto.sctp.init[dir] && ct->proto.sctp.init[!dir]. (Scenario E) - set ct->proto.sctp.init[dir]. In SCTP_CID_INIT_ACK processing: - drop it if !ct->proto.sctp.init[!dir] && ct->proto.sctp.vtag[!dir] && ct->proto.sctp.vtag[!dir] != ih->init_tag. (Scenario B, Scenario C) - drop it if ct->proto.sctp.init[dir] && ct->proto.sctp.init[!dir] && ct->proto.sctp.vtag[!dir] != ih->init_tag. (Scenario A) In SCTP_CID_COOKIE_ACK processing: - clear ct->proto.sctp.init[dir] and ct->proto.sctp.init[!dir]. (Scenario D) Also, it's important to allow the ct state to move forward with cookie_echo and cookie_ack from the opposite dir for the collision scenarios. There are also other Scenarios where it should allow the packet through, addressed by the processing above: Scenario C: new CT is created by INIT_ACK. Scenario D: start INIT on the existing ESTABLISHED ct. Scenario E: start INIT after the old collision on the existing ESTABLISHED ct. 192.168.1.2 > 192.168.1.1: sctp (1) [INIT] [init tag: 3922216408] 192.168.1.1 > 192.168.1.2: sctp (1) [INIT] [init tag: 144230885] (both side are stopped, then start new connection again in hours) 192.168.1.2 > 192.168.1.1: sctp (1) [INIT] [init tag: 242308742] Fixes: 9fb9cbb1082d ("[NETFILTER]: Add nf_conntrack subsystem.") Signed-off-by: Xin Long <[email protected]> Signed-off-by: Florian Westphal <[email protected]>
2023-10-04gpiolib: reluctantly provide gpio_device_get_chip()Bartosz Golaszewski1-0/+2
The process of converting all unauthorized users of struct gpio_chip to using dedicated struct gpio_device function will be long so in the meantime we must provide a way of retrieving the pointer to struct gpio_chip from a GPIO device. Signed-off-by: Bartosz Golaszewski <[email protected]> Reviewed-by: Linus Walleij <[email protected]>
2023-10-04gpiolib: provide gpio_device_get_desc()Bartosz Golaszewski1-0/+2
Getting the GPIO descriptor directly from the gpio_chip struct is dangerous as we don't take the reference to the underlying GPIO device. In order to start working towards removing gpiochip_get_desc(), let's provide a safer variant that works with an existing reference to struct gpio_device. Signed-off-by: Bartosz Golaszewski <[email protected]> Reviewed-by: Linus Walleij <[email protected]>
2023-10-04gpiolib: provide gpio_device_find_by_label()Bartosz Golaszewski1-0/+1
By far the most common way of looking up GPIO devices is using their label. Provide a helpers for that to avoid every user implementing their own matching function. Signed-off-by: Bartosz Golaszewski <[email protected]> Reviewed-by: Linus Walleij <[email protected]>
2023-10-04gpiolib: provide gpio_device_find()Bartosz Golaszewski1-0/+3
gpiochip_find() is wrong and its kernel doc is misleading as the function doesn't return a reference to the gpio_chip but just a raw pointer. The chip itself is not guaranteed to stay alive, in fact it can be deleted at any point. Also: other than GPIO drivers themselves, nobody else has any business accessing gpio_chip structs. Provide a new gpio_device_find() function that returns a real reference to the opaque gpio_device structure that is guaranteed to stay alive for as long as there are active users of it. Signed-off-by: Bartosz Golaszewski <[email protected]> Reviewed-by: Linus Walleij <[email protected]>
2023-10-04gpiolib: add support for scope-based management to gpio_deviceBartosz Golaszewski1-0/+5
As the few users that need to get the reference to the GPIO device often release it right after inspecting its properties, let's add support for the automatic reference release to struct gpio_device. Signed-off-by: Bartosz Golaszewski <[email protected]> Reviewed-by: Linus Walleij <[email protected]> Reviewed-by: Andy Shevchenko <[email protected]>
2023-10-04gpiolib: make gpio_device_get() and gpio_device_put() publicBartosz Golaszewski1-0/+3
In order to start migrating away from accessing struct gpio_chip by users other than their owners, let's first make the reference management functions for the opaque struct gpio_device public in the driver.h header. Signed-off-by: Bartosz Golaszewski <[email protected]> Reviewed-by: Linus Walleij <[email protected]> Reviewed-by: Andy Shevchenko <[email protected]>
2023-10-04OMAP/gpio: drop MPUIO static baseLinus Walleij1-3/+0
The OMAP GPIO driver hardcodes the MPIO chip base, but there is no point: we have already moved all consumers over to using descriptor look-ups. Drop the MPUIO GPIO base and use dynamic assignment. Root out the unused instances of the OMAP_MPUIO() macro and delete the unused OMAP_GPIO_IS_MPUIO() macro. Signed-off-by: Linus Walleij <[email protected]> Reviewed-by: Tony Lindgren <[email protected]> Tested-by: Janusz Krzysztofik <[email protected]> Signed-off-by: Bartosz Golaszewski <[email protected]>
2023-10-04platform/x86/intel/tpmi: Add defines to get version informationSrinivas Pandruvada1-0/+6
Add defines to get major and minor version from a TPMI version field value. This will avoid code duplication to convert in every feature driver. Also add define for invalid version field. Signed-off-by: Srinivas Pandruvada <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Ilpo Järvinen <[email protected]> Signed-off-by: Ilpo Järvinen <[email protected]>
2023-10-04platform/chrome: cros_ec_proto: Mark outdata as constStephen Boyd1-1/+1
The 'outdata' is copied to the data buffer in cros_ec_cmd() before being sent over to the EC. Mark the argument as const so that callers can pass const pointers to this function and so that callers know the data won't be modified. Cc: Prashant Malani <[email protected]> Signed-off-by: Stephen Boyd <[email protected]> Acked-by: Prashant Malani <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Tzung-Bi Shih <[email protected]>
2023-10-03mm: Remove unused vm_brk()Kees Cook1-2/+1
With fs/binfmt_elf.c fully refactored to use the new elf_load() helper, there are no more users of vm_brk(), so remove it. Cc: Andrew Morton <[email protected]> Cc: [email protected] Suggested-by: Eric Biederman <[email protected]> Tested-by: Pedro Falcato <[email protected]> Signed-off-by: Sebastian Ott <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Kees Cook <[email protected]>
2023-10-03net: add sysctl to disable rfc4862 5.5.3e lifetime handlingPatrick Rohr1-0/+1
This change adds a sysctl to opt-out of RFC4862 section 5.5.3e's valid lifetime derivation mechanism. RFC4862 section 5.5.3e prescribes that the valid lifetime in a Router Advertisement PIO shall be ignored if it less than 2 hours and to reset the lifetime of the corresponding address to 2 hours. An in-progress 6man draft (see draft-ietf-6man-slaac-renum-07 section 4.2) is currently looking to remove this mechanism. While this draft has not been moving particularly quickly for other reasons, there is widespread consensus on section 4.2 which updates RFC4862 section 5.5.3e. Cc: Maciej Żenczykowski <[email protected]> Cc: Lorenzo Colitti <[email protected]> Cc: Jen Linkova <[email protected]> Signed-off-by: Patrick Rohr <[email protected]> Reviewed-by: Jiri Pirko <[email protected]> Reviewed-by: David Ahern <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2023-10-03PNP: Clean up coding style in pnp.hGuoHua Cheng1-4/+4
Address the following checkpatch complaints: ERROR: "foo * bar" should be "foo *bar" ERROR: space required after that ';' (ctx:VxV) Signed-off-by: GuoHua Cheng <[email protected]> [ rjw: Subject and changelog edits ] Signed-off-by: Rafael J. Wysocki <[email protected]>