aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2024-09-09af_unix: Remove single nest in manage_oob().Kuniyuki Iwashima1-22/+23
This is a prep for the later fix. No functional change intended. Signed-off-by: Kuniyuki Iwashima <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2024-09-09Merge tag 'linux-can-next-for-6.12-20240909' of ↵Jakub Kicinski4-24/+14
git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next Marc Kleine-Budde says: ==================== pull-request: can-next 2024-09-09 The first patch is by Rob Herring and simplifies the DT parsing in the cc770 driver. The next 2 patches both target the rockchip_canfd driver added in the last pull request to net-next. The first one is by Nathan Chancellor and fixes the return type of rkcanfd_start_xmit(), the second one is by me and fixes a 64 bit division on 32 bit platforms in rkcanfd_timestamp_init(). * tag 'linux-can-next-for-6.12-20240909' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next: can: rockchip_canfd: rkcanfd_timestamp_init(): fix 64 bit division on 32 bit platforms can: rockchip_canfd: fix return type of rkcanfd_start_xmit() net: can: cc770: Simplify parsing DT properties ==================== Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2024-09-09net: remove dev_pick_tx_cpu_id()Jakub Kicinski2-9/+0
dev_pick_tx_cpu_id() has been introduced with two users by commit a4ea8a3dacc3 ("net: Add generic ndo_select_queue functions"). The use in AF_PACKET has been removed in 2019 by commit b71b5837f871 ("packet: rework packet_pick_tx_queue() to use common code selection") The other user was a Netlogic XLP driver, removed in 2021 by commit 47ac6f567c28 ("staging: Remove Netlogic XLP network driver"). It's relatively unlikely that any modern driver will need an .ndo_select_queue implementation which picks purely based on CPU ID and skips XPS, delete dev_pick_tx_cpu_id() Found by code inspection. Reviewed-by: Eric Dumazet <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2024-09-09Merge branch 'selftests-mptcp-add-time-per-subtests-in-tap-output'Jakub Kicinski8-10/+34
Matthieu Baerts says: ==================== selftests: mptcp: add time per subtests in TAP output Patches here add 'time=<N>ms' in the diagnostic data of the TAP output, e.g. ok 1 - pm_netlink: defaults addr list # time=9ms This addition is useful to quickly identify which subtests are taking a longer time than the others, or more than expected. Note that there are no specific formats to follow to show this time according to the TAP 13, TAP 14 and KTAP specifications, but we follow the format being parsed by NIPA [1]. Patch 1 modifies mptcp_lib.sh to add this support to all MPTCP selftests. Patch 2 removes the now duplicated info in mptcp_connect.sh Patch 3 slightly improves the precision of the first subtests in all MPTCP subtests. Patches 4 and 5 remove duplicated spaces in TAP output, for the TAP parsers that cannot handle them properly. v1: https://lore.kernel.org/20240902-net-next-mptcp-ksft-subtest-time-v1-0-f1ed499a11b1@kernel.org Link: https://github.com/linux-netdev/nipa/pull/36 ==================== Link: https://patch.msgid.link/20240906-net-next-mptcp-ksft-subtest-time-v2-0-31d5ee4f3bdf@kernel.org Signed-off-by: Jakub Kicinski <[email protected]>
2024-09-09selftests: mptcp: connect: remove duplicated spaces in TAP outputMatthieu Baerts (NGI0)1-6/+8
It is nice to have a visual alignment in the test output to present the different results, but it makes less sense in the TAP output that is there for computers. It sounds then better to remove the duplicated whitespaces in the TAP output, also because it can cause some issues with TAP parsers expecting only one space around the directive delimiter (#). While at it, change the variable name (result_msg) to something more explicit. Signed-off-by: Matthieu Baerts (NGI0) <[email protected]> Link: https://patch.msgid.link/20240906-net-next-mptcp-ksft-subtest-time-v2-5-31d5ee4f3bdf@kernel.org Signed-off-by: Jakub Kicinski <[email protected]>
2024-09-09selftests: mptcp: diag: remove trailing whitespaceMatthieu Baerts (NGI0)1-1/+1
It doesn't need to be there, and it can cause some issues with TAP parsers expecting only one space around the directive delimiter (#). Signed-off-by: Matthieu Baerts (NGI0) <[email protected]> Link: https://patch.msgid.link/20240906-net-next-mptcp-ksft-subtest-time-v2-4-31d5ee4f3bdf@kernel.org Signed-off-by: Jakub Kicinski <[email protected]>
2024-09-09selftests: mptcp: reset the last TS before the first testMatthieu Baerts (NGI0)6-1/+9
Just to slightly improve the precision of the duration of the first test. In mptcp_join.sh, the last append_prev_results is now done as soon as the last test is over: this will add the last result in the list, and get a more precise time for this last test. Reviewed-by: Mat Martineau <[email protected]> Signed-off-by: Matthieu Baerts (NGI0) <[email protected]> Link: https://patch.msgid.link/20240906-net-next-mptcp-ksft-subtest-time-v2-3-31d5ee4f3bdf@kernel.org Signed-off-by: Jakub Kicinski <[email protected]>
2024-09-09selftests: mptcp: connect: remote time in TAP outputMatthieu Baerts (NGI0)1-1/+0
It is now added by the MPTCP lib automatically, see the parent commit. The time in the TAP output might be slightly different from the one displayed before, but that's OK. Reviewed-by: Mat Martineau <[email protected]> Signed-off-by: Matthieu Baerts (NGI0) <[email protected]> Link: https://patch.msgid.link/20240906-net-next-mptcp-ksft-subtest-time-v2-2-31d5ee4f3bdf@kernel.org Signed-off-by: Jakub Kicinski <[email protected]>
2024-09-09selftests: mptcp: lib: add time per subtests in TAP outputMatthieu Baerts (NGI0)1-1/+16
It adds 'time=<N>ms' in the diagnostic data of the TAP output, e.g. ok 1 - pm_netlink: defaults addr list # time=9ms This addition is useful to quickly identify which subtests are taking a longer time than the others, or more than expected. Note that there are no specific formats to follow to show this time according to the TAP 13 [1], TAP 14 [2] and KTAP [3] specifications. Let's then define this one here. Link: https://testanything.org/tap-version-13-specification.html [1] Link: https://testanything.org/tap-version-14-specification.html [2] Link: https://docs.kernel.org/dev-tools/ktap.html [3] Reviewed-by: Mat Martineau <[email protected]> Signed-off-by: Matthieu Baerts (NGI0) <[email protected]> Link: https://patch.msgid.link/20240906-net-next-mptcp-ksft-subtest-time-v2-1-31d5ee4f3bdf@kernel.org Signed-off-by: Jakub Kicinski <[email protected]>
2024-09-09treewide: correct the typo 'retun'WangYuli7-7/+7
There are some spelling mistakes of 'retun' in comments which should be instead of 'return'. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: WangYuli <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-09ocfs2: cleanup return value and mlog in ocfs2_global_read_info()Joseph Qi1-6/+9
Return 0 instead of sizeof(ocfs2_global_disk_dqinfo) that .quota_read returns in normal case. Also cleanup mlog to make code more readable. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Joseph Qi <[email protected]> Reviewed-by: Heming Zhao <[email protected]> Cc: Mark Fasheh <[email protected]> Cc: Joel Becker <[email protected]> Cc: Junxiao Bi <[email protected]> Cc: Changwei Ge <[email protected]> Cc: Gang He <[email protected]> Cc: Jun Piao <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-09nilfs2: remove duplicate 'unlikely()' usageKunwu Chan1-1/+1
Nested unlikely() calls, IS_ERR already uses unlikely() internally Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kunwu Chan <[email protected]> Signed-off-by: Ryusuke Konishi <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-09nilfs2: fix potential oob read in nilfs_btree_check_delete()Ryusuke Konishi1-2/+5
The function nilfs_btree_check_delete(), which checks whether degeneration to direct mapping occurs before deleting a b-tree entry, causes memory access outside the block buffer when retrieving the maximum key if the root node has no entries. This does not usually happen because b-tree mappings with 0 child nodes are never created by mkfs.nilfs2 or nilfs2 itself. However, it can happen if the b-tree root node read from a device is configured that way, so fix this potential issue by adding a check for that case. Link: https://lkml.kernel.org/r/[email protected] Fixes: 17c76b0104e4 ("nilfs2: B-tree based block mapping") Signed-off-by: Ryusuke Konishi <[email protected]> Cc: Lizhi Xu <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-09nilfs2: determine empty node blocks as corruptedRyusuke Konishi1-1/+1
Due to the nature of b-trees, nilfs2 itself and admin tools such as mkfs.nilfs2 will never create an intermediate b-tree node block with 0 child nodes, nor will they delete (key, pointer)-entries that would result in such a state. However, it is possible that a b-tree node block is corrupted on the backing device and is read with 0 child nodes. Because operation is not guaranteed if the number of child nodes is 0 for intermediate node blocks other than the root node, modify nilfs_btree_node_broken(), which performs sanity checks when reading a b-tree node block, so that such cases will be judged as metadata corruption. Link: https://lkml.kernel.org/r/[email protected] Fixes: 17c76b0104e4 ("nilfs2: B-tree based block mapping") Signed-off-by: Ryusuke Konishi <[email protected]> Cc: Lizhi Xu <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-09nilfs2: fix potential null-ptr-deref in nilfs_btree_insert()Ryusuke Konishi1-1/+2
Patch series "nilfs2: fix potential issues with empty b-tree nodes". This series addresses three potential issues with empty b-tree nodes that can occur with corrupted filesystem images, including one recently discovered by syzbot. This patch (of 3): If a b-tree is broken on the device, and the b-tree height is greater than 2 (the level of the root node is greater than 1) even if the number of child nodes of the b-tree root is 0, a NULL pointer dereference occurs in nilfs_btree_prepare_insert(), which is called from nilfs_btree_insert(). This is because, when the number of child nodes of the b-tree root is 0, nilfs_btree_do_lookup() does not set the block buffer head in any of path[x].bp_bh, leaving it as the initial value of NULL, but if the level of the b-tree root node is greater than 1, nilfs_btree_get_nonroot_node(), which accesses the buffer memory of path[x].bp_bh, is called. Fix this issue by adding a check to nilfs_btree_root_broken(), which performs sanity checks when reading the root node from the device, to detect this inconsistency. Thanks to Lizhi Xu for trying to solve the bug and clarifying the cause early on. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: 17c76b0104e4 ("nilfs2: B-tree based block mapping") Signed-off-by: Ryusuke Konishi <[email protected]> Reported-by: [email protected] Closes: https://syzkaller.appspot.com/bug?extid=9bff4c7b992038a7409f Cc: Lizhi Xu <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-09user_namespace: use kmemdup_array() instead of kmemdup() for multiple allocationJinjie Ruan1-3/+2
Let the kmemdup_array() take care about multiplication and possible overflows. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Jinjie Ruan <[email protected]> Reviewed-by: Kees Cook <[email protected]> Cc: Alexey Dobriyan <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Li zeming <[email protected]> Cc: Randy Dunlap <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-09tools/mm: rm thp_swap_allocator_test when make cleanzhangjiao1-1/+1
rm thp_swap_allocator_test when make clean Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: zhangjiao <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-09squashfs: fix percpu address space issues in decompressor_multi_percpu.cUros Bizjak1-3/+3
When strict percpu address space checks are enabled, then current direct casts between the percpu address space and the generic address space fail the compilation on x86_64 with: decompressor_multi_percpu.c: In function `squashfs_decompressor_create': decompressor_multi_percpu.c:49:16: error: cast to generic address space pointer from disjoint `__seg_gs' address space pointer decompressor_multi_percpu.c: In function `squashfs_decompressor_destroy': decompressor_multi_percpu.c:64:25: error: cast to `__seg_gs' address space pointer from disjoint generic address space pointer decompressor_multi_percpu.c: In function `squashfs_decompress': decompressor_multi_percpu.c:82:25: error: cast to `__seg_gs' address space pointer from disjoint generic address space pointer Add intermediate casts to unsigned long, as advised in [1] and [2]. Side note: sparse still requires __force when casting from the percpu address space, although the documentation [2] allows casts to unsigned long without __force attribute. Found by GCC's named address space checks. There were no changes in the resulting object file. [1] https://gcc.gnu.org/onlinedocs/gcc/Named-Address-Spaces.html#x86-Named-Address-Spaces [2] https://sparse.docs.kernel.org/en/latest/annotations.html#address-space-name Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Uros Bizjak <[email protected]> Cc: Phillip Lougher <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-09lib: glob.c: added null check for character classAlok Swaminathan1-0/+2
Add null check for character class. Previously, an inverted character class could result in a nul byte being matched and lead to the function reading past the end of the inputted string. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Alok Swaminathan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-09scx_qmap: Implement highpri boostingTejun Heo2-14/+130
Implement a silly boosting mechanism for nice -20 tasks. The only purpose is demonstrating and testing scx_bpf_dispatch_from_dsq(). The boosting only works within SHARED_DSQ and makes only minor differences with increased dispatch batch (-b). This exercises moving tasks to a user DSQ and all local DSQs from ops.dispatch() and BPF timerfn. v2: - Updated to use scx_bpf_dispatch_from_dsq_set_{slice|vtime}(). - Drop the workaround for the iterated tasks not being trusted by the verifier. The issue is fixed from BPF side. Signed-off-by: Tejun Heo <[email protected]> Cc: Daniel Hodges <[email protected]> Cc: David Vernet <[email protected]> Cc: Changwoo Min <[email protected]> Cc: Andrea Righi <[email protected]> Cc: Dan Schatzberg <[email protected]>
2024-09-09sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq()Tejun Heo2-3/+239
Once a task is put into a DSQ, the allowed operations are fairly limited. Tasks in the built-in local and global DSQs are executed automatically and, ignoring dequeue, there is only one way a task in a user DSQ can be manipulated - scx_bpf_consume() moves the first task to the dispatching local DSQ. This inflexibility sometimes gets in the way and is an area where multiple feature requests have been made. Implement scx_bpf_dispatch[_vtime]_from_dsq(), which can be called during DSQ iteration and can move the task to any DSQ - local DSQs, global DSQ and user DSQs. The kfuncs can be called from ops.dispatch() and any BPF context which dosen't hold a rq lock including BPF timers and SYSCALL programs. This is an expansion of an earlier patch which only allowed moving into the dispatching local DSQ: http://lkml.kernel.org/r/[email protected] v2: Remove @slice and @vtime from scx_bpf_dispatch_from_dsq[_vtime]() as they push scx_bpf_dispatch_from_dsq_vtime() over the kfunc argument count limit and often won't be needed anyway. Instead provide scx_bpf_dispatch_from_dsq_set_{slice|vtime}() kfuncs which can be called only when needed and override the specified parameter for the subsequent dispatch. Signed-off-by: Tejun Heo <[email protected]> Cc: Daniel Hodges <[email protected]> Cc: David Vernet <[email protected]> Cc: Changwoo Min <[email protected]> Cc: Andrea Righi <[email protected]> Cc: Dan Schatzberg <[email protected]>
2024-09-09sched_ext: Compact struct bpf_iter_scx_dsq_kernTejun Heo2-12/+21
struct scx_iter_scx_dsq is defined as 6 u64's and scx_dsq_iter_kern was using 5 of them. We want to add two more u64 fields but it's better if we do so while staying within scx_iter_scx_dsq to maintain binary compatibility. The way scx_iter_scx_dsq_kern is laid out is rather inefficient - the node field takes up three u64's but only one bit of the last u64 is used. Turn the bool into u32 flags and only use the lower 16 bits freeing up 48 bits - 16 bits for flags, 32 bits for a u32 - for use by struct bpf_iter_scx_dsq_kern. This allows moving the dsq_seq and flags fields of bpf_iter_scx_dsq_kern into the cursor field reducing the struct size by a full u64. No behavior changes intended. Signed-off-by: Tejun Heo <[email protected]>
2024-09-09sched_ext: Replace consume_local_task() with move_local_task_to_local_dsq()Tejun Heo1-16/+26
- Rename move_task_to_local_dsq() to move_remote_task_to_local_dsq(). - Rename consume_local_task() to move_local_task_to_local_dsq() and remove task_unlink_from_dsq() and source DSQ unlocking from it. This is to make the migration code easier to reuse. No functional changes intended. Signed-off-by: Tejun Heo <[email protected]> Acked-by: David Vernet <[email protected]>
2024-09-09sched_ext: Move consume_local_task() upwardTejun Heo1-17/+14
So that the local case comes first and two CONFIG_SMP blocks can be merged. No functional changes intended. Signed-off-by: Tejun Heo <[email protected]> Acked-by: David Vernet <[email protected]>
2024-09-09sched_ext: Move sanity check and dsq_mod_nr() into task_unlink_from_dsq()Tejun Heo1-4/+3
All task_unlink_from_dsq() users are doing dsq_mod_nr(dsq, -1). Move it into task_unlink_from_dsq(). Also move sanity check into it. No functional changes intended. Signed-off-by: Tejun Heo <[email protected]> Acked-by: David Vernet <[email protected]>
2024-09-09sched_ext: Reorder args for consume_local/remote_task()Tejun Heo1-7/+7
Reorder args for consistency in the order of: current_rq, p, src_[rq|dsq], dst_[rq|dsq]. No functional changes intended. Signed-off-by: Tejun Heo <[email protected]>
2024-09-09sched_ext: Restructure dispatch_to_local_dsq()Tejun Heo1-50/+46
Now that there's nothing left after the big if block, flip the if condition and unindent the body. No functional changes intended. v2: Add BUG() to clarify control can't reach the end of dispatch_to_local_dsq() in UP kernels per David. Signed-off-by: Tejun Heo <[email protected]> Acked-by: David Vernet <[email protected]>
2024-09-09sched_ext: Fix processs_ddsp_deferred_locals() by unifying DTL_INVALID handlingTejun Heo1-31/+10
With the preceding update, the only return value which makes meaningful difference is DTL_INVALID, for which one caller, finish_dispatch(), falls back to the global DSQ and the other, process_ddsp_deferred_locals(), doesn't do anything. It should always fallback to the global DSQ. Move the global DSQ fallback into dispatch_to_local_dsq() and remove the return value. v2: Patch title and description updated to reflect the behavior fix for process_ddsp_deferred_locals(). Signed-off-by: Tejun Heo <[email protected]> Acked-by: David Vernet <[email protected]>
2024-09-09sched_ext: Make find_dsq_for_dispatch() handle SCX_DSQ_LOCAL_ONTejun Heo1-50/+40
find_dsq_for_dispatch() handles all DSQ IDs except SCX_DSQ_LOCAL_ON. Instead, each caller is hanlding SCX_DSQ_LOCAL_ON before calling it. Move SCX_DSQ_LOCAL_ON lookup into find_dsq_for_dispatch() to remove duplicate code in direct_dispatch() and dispatch_to_local_dsq(). No functional changes intended. Signed-off-by: Tejun Heo <[email protected]> Acked-by: David Vernet <[email protected]>
2024-09-09sched_ext: Refactor consume_remote_task()Tejun Heo1-69/+76
The tricky p->scx.holding_cpu handling was split across consume_remote_task() body and move_task_to_local_dsq(). Refactor such that: - All the tricky part is now in the new unlink_dsq_and_lock_src_rq() with consolidated documentation. - move_task_to_local_dsq() now implements straightforward task migration making it easier to use in other places. - dispatch_to_local_dsq() is another user move_task_to_local_dsq(). The usage is updated accordingly. This makes the local and remote cases more symmetric. No functional changes intended. v2: s/task_rq/src_rq/ for consistency. Signed-off-by: Tejun Heo <[email protected]> Acked-by: David Vernet <[email protected]>
2024-09-09sched_ext: Rename scx_kfunc_set_sleepable to unlocked and relocateTejun Heo1-33/+33
Sleepables don't need to be in its own kfunc set as each is tagged with KF_SLEEPABLE. Rename to scx_kfunc_set_unlocked indicating that rq lock is not held and relocate right above the any set. This will be used to add kfuncs that are allowed to be called from SYSCALL but not TRACING. No functional changes intended. Signed-off-by: Tejun Heo <[email protected]> Acked-by: David Vernet <[email protected]>
2024-09-09selftests: return failure when timestamps can't be reportedJason Xing1-1/+5
When I was trying to modify the tx timestamping feature, I found that running "./txtimestamp -4 -C -L 127.0.0.1" didn't reflect the error: I succeeded to generate timestamp stored in the skb but later failed to report it to the userspace (which means failed to put css into cmsg). It can happen when someone writes buggy codes in __sock_recv_timestamp(), for example. After adding the check so that running ./txtimestamp will reflect the result correctly like this if there is a bug in the reporting phase: protocol: TCP payload: 10 server port: 9000 family: INET test SND USR: 1725458477 s 667997 us (seq=0, len=0) Failed to report timestamps USR: 1725458477 s 718128 us (seq=0, len=0) Failed to report timestamps USR: 1725458477 s 768273 us (seq=0, len=0) Failed to report timestamps USR: 1725458477 s 818416 us (seq=0, len=0) Failed to report timestamps ... In the future, it will help us detect whether the new coming patch has bugs or not. Signed-off-by: Jason Xing <[email protected]> Reviewed-by: Willem de Bruijn <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2024-09-09mm/codetag: add pgalloc_tag_copy()Yu Zhao3-14/+38
Add pgalloc_tag_copy() to transfer the codetag from the old folio to the new one during migration. This makes original allocation sites persist cross migration rather than lump into the get_new_folio callbacks passed into migrate_pages(), e.g., compaction_alloc(): # echo 1 >/proc/sys/vm/compact_memory # grep compaction_alloc /proc/allocinfo Before this patch: 132968448 32463 mm/compaction.c:1880 func:compaction_alloc After this patch: 0 0 mm/compaction.c:1880 func:compaction_alloc Link: https://lkml.kernel.org/r/[email protected] Fixes: dcfe378c81f7 ("lib: introduce support for page allocation tagging") Signed-off-by: Yu Zhao <[email protected]> Acked-by: Suren Baghdasaryan <[email protected]> Cc: Kent Overstreet <[email protected]> Cc: Muchun Song <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-09mm/codetag: fix pgalloc_tag_split()Yu Zhao5-35/+34
The current assumption is that a large folio can only be split into order-0 folios. That is not the case for hugeTLB demotion, nor for THP split: see commit c010d47f107f ("mm: thp: split huge page to any lower order pages"). When a large folio is split into ones of a lower non-zero order, only the new head pages should be tagged. Tagging tail pages can cause imbalanced "calls" counters, since only head pages are untagged by pgalloc_tag_sub() and the "calls" counts on tail pages are leaked, e.g., # echo 2048kB >/sys/kernel/mm/hugepages/hugepages-1048576kB/demote_size # echo 700 >/sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages # time echo 700 >/sys/kernel/mm/hugepages/hugepages-1048576kB/demote # echo 0 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages # grep alloc_gigantic_folio /proc/allocinfo Before this patch: 0 549427200 mm/hugetlb.c:1549 func:alloc_gigantic_folio real 0m2.057s user 0m0.000s sys 0m2.051s After this patch: 0 0 mm/hugetlb.c:1549 func:alloc_gigantic_folio real 0m1.711s user 0m0.000s sys 0m1.704s Not tagging tail pages also improves the splitting time, e.g., by about 15% when demoting 1GB hugeTLB folios to 2MB ones, as shown above. Link: https://lkml.kernel.org/r/[email protected] Fixes: be25d1d4e822 ("mm: create new codetag references during page splitting") Signed-off-by: Yu Zhao <[email protected]> Acked-by: Suren Baghdasaryan <[email protected]> Cc: Kent Overstreet <[email protected]> Cc: Muchun Song <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-09mm/codetag: fix a typoYu Zhao1-1/+1
Link: https://lkml.kernel.org/r/[email protected] Fixes: 22d407b164ff ("lib: add allocation tagging support for memory allocation profiling") Signed-off-by: Yu Zhao <[email protected]> Acked-by: Suren Baghdasaryan <[email protected]> Acked-by: Muchun Song <[email protected]> Cc: Kent Overstreet <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-09mm/vmalloc.c: use "high-order" in description non 0-order pagesUladzislau Rezki (Sony)1-2/+2
In many places, in the comments, we use both "higher-order" and "high-order" to describe the non 0-order pages. That is confusing, because a "higher-order" statement does not reflect what it is compared with. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Uladzislau Rezki (Sony) <[email protected]> Suggested-by: Baoquan He <[email protected]> Reviewed-by: Baoquan He <[email protected]> Cc: Christoph Hellwig <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Oleksiy Avramchenko <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-09mm/vmalloc.c: use helper function va_size()ZhangPeng1-9/+8
Use helper function va_size() to improve code readability. No functional modification involved. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: ZhangPeng <[email protected]> Reviewed-by: Uladzislau Rezki (Sony) <[email protected]> Cc: Christoph Hellwig <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-09mm: replace xa_get_order with xas_get_order where appropriateShakeel Butt2-4/+4
The tracing of invalidation and truncation operations on large files showed that xa_get_order() is among the top functions where kernel spends a lot of CPUs. xa_get_order() needs to traverse the tree to reach the right node for a given index and then extract the order of the entry. However it seems like at many places it is being called within an already happening tree traversal where there is no need to do another traversal. Just use xas_get_order() at those places. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Shakeel Butt <[email protected]> Reviewed-by: Liam R. Howlett <[email protected]> Cc: Baolin Wang <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Nhat Pham <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-09maple_tree: mark three functions as __maybe_unusedLiam R. Howlett1-3/+3
People keep trying to remove three functions that are going to be used in a feature that is being developed. Dropping the functions entirely may end up with people trying to use the bit for other uses, as people have tried in the past. Adding __maybe_unused stops compilers complaining about the unused functions so they can be silently optimised out of the compiled code and people won't try to claim the bit for another use. Link: https://lore.kernel.org/all/[email protected]/ Link: https://lore.kernel.org/all/[email protected]/ Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Liam R. Howlett <[email protected]> Reviewed-by: Lorenzo Stoakes <[email protected]> Reviewed-by: Kuan-Wei Chiu <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-09mm: clean up mem_cgroup_iter()Kinsey Ho1-20/+12
A clean up to make variable names more clear and to improve code readability. No functional change. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kinsey Ho <[email protected]> Reviewed-by: T.J. Mercier <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Michal Koutný <[email protected]> Cc: Muchun Song <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Yosry Ahmed <[email protected]> Cc: Zefan Li <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-09mm: restart if multiple traversals racedKinsey Ho2-11/+19
Currently, if multiple reclaimers raced on the same position, the reclaimers which detect the race will still reclaim from the same memcg. Instead, the reclaimers which detect the race should move on to the next memcg in the hierarchy. So, in the case where multiple traversals race, jump back to the start of the mem_cgroup_iter() function to find the next memcg in the hierarchy to reclaim from. Link: https://lkml.kernel.org/r/[email protected] Reported-by: [email protected] Closes: https://lore.kernel.org/[email protected]/ Signed-off-by: Kinsey Ho <[email protected]> Reviewed-by: T.J. Mercier <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Michal Koutný <[email protected]> Cc: Muchun Song <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Yosry Ahmed <[email protected]> Cc: Zefan Li <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-09mm: increment gen # before restarting traversalKinsey Ho1-10/+12
The generation number in struct mem_cgroup_reclaim_iter should be incremented on every round-trip. Currently, it is possible for a concurrent reclaimer to jump in at the end of the hierarchy, causing a traversal restart (resetting the iteration position) without incrementing the generation number. By resetting the position without incrementing the generation, it's possible for another ongoing mem_cgroup_iter() thread to walk the tree twice. Move the traversal restart such that the generation number is incremented before the restart. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kinsey Ho <[email protected]> Reviewed-by: T.J. Mercier <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Michal Koutný <[email protected]> Cc: Muchun Song <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Yosry Ahmed <[email protected]> Cc: Zefan Li <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-09mm: don't hold css->refcnt during traversalKinsey Ho1-17/+1
To obtain the pointer to the next memcg position, mem_cgroup_iter() currently holds css->refcnt during memcg traversal only to put css->refcnt at the end of the routine. This isn't necessary as an rcu_read_lock is already held throughout the function. The use of the RCU read lock with css_next_descendant_pre() guarantees that sibling linkage is safe without holding a ref on the passed-in @css. Remove css->refcnt usage during traversal by leveraging RCU. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kinsey Ho <[email protected]> Reviewed-by: T.J. Mercier <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Michal Koutný <[email protected]> Cc: Muchun Song <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Yosry Ahmed <[email protected]> Cc: Zefan Li <[email protected]> Cc: Hugh Dickins <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-09cgroup: clarify css sibling linkage is protected by cgroup_mutex or RCUKinsey Ho2-8/+14
Patch series "Improve mem_cgroup_iter()", v4. Incremental cgroup iteration is being used again [1]. This patchset improves the reliability of mem_cgroup_iter(). It also improves simplicity and code readability. [1] https://lore.kernel.org/[email protected]/ This patch (of 5): Explicitly document that css sibling/descendant linkage is protected by cgroup_mutex or RCU. Also, document in css_next_descendant_pre() and similar functions that it isn't necessary to hold a ref on @pos. The following changes in this patchset rely on this clarification for simplification in memcg iteration code. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Suggested-by: Yosry Ahmed <[email protected]> Reviewed-by: Michal Koutný <[email protected]> Signed-off-by: Kinsey Ho <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Muchun Song <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Zefan Li <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: T.J. Mercier <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-09mm/page_alloc: fix build with CONFIG_UNACCEPTED_MEMORY=nAndrew Morton1-11/+5
When has_unaccepted_memory() is unused, it prevents kernel builds with clang, `make W=1` and CONFIG_WERROR=y: mm/page_alloc.c:7036:20: error: unused function 'has_unaccepted_memory' [-Werror,-Wunused-function] 7036 | static inline bool has_unaccepted_memory(void) | ^~~~~~~~~~~~~~~~~~~~~ Fix it by removeing the CONFIG_UNACCEPTED_MEMORY=n stub. Link: https://lkml.kernel.org/r/[email protected] Reported-by: Andy Shevchenko <[email protected]> Closes: https://lkml.kernel.org/r/[email protected] Signed-off-by: Andrew Morton <[email protected]>
2024-09-09mm: migrate: remove unused includesKefeng Wang1-7/+0
random.h is not needed since commit 6c542ab75714 ("mm/demotion: build demotion targets based on explicit memory tiers"), all functions moved into memory-tiers. nsproxy.h is not needed since commit 228ebcbe634a ("Uninline find_task_by_xxx set of functions"), no nsproxy, we only call find_task_by_vpid() now. hugetlb_cgroup.h is not needed since commit ab5ac90aecf5 ("mm, hugetlb: do not rely on overcommit limit during migration"), move_hugetlb_state() is called and it belongs to hugetlb.h, which is already included. balloon_compaction.h, we have more general movable_operations for non-lru movable page migration, so it could be dropped. memremap.h, userfaultfd_k.h and oom.h are introduced for zone device page migration, but all functions are moved into migrate_device.c, so no needed anymore too. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kefeng Wang <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-09mm: thp: simplify split_huge_pages_pid()Nanyong Sun1-6/+1
The helper find_get_task_by_vpid() can totally replace the task_struct find logic in split_huge_pages_pid(), so use it to simplify the code. Also delete the needless comments for the helper function name already explains what it's doing here. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Nanyong Sun <[email protected]> Cc: Kefeng Wang <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-09mm: migrate: simplify find_mm_struct()Nanyong Sun1-7/+1
Use find_get_task_by_vpid() to replace the task_struct find logic in find_mm_struct(), note that this patch move the ptrace_may_access() call out from rcu_read_lock() scope, this is ok because it actually does not need it, find_get_task_by_vpid() already get the pid and task safely, ptrace_may_access() can use the task safely, like what sched_core_share_pid() similarly do. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Nanyong Sun <[email protected]> Cc: Kefeng Wang <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-09mm/damon/tests/core-kunit: skip damon_test_nr_accesses_to_accesses_bp() if ↵SeongJae Park2-1/+19
aggr_interval is zero The aggregation interval of test purpose damon_attrs for damon_test_nr_accesses_to_accesses_bp() becomes zero on 32 bit architecture, since size of int and long types are same. As a result, damon_nr_accesses_to_accesses_bp() call with the test data triggers divide-by-zero exception. damon_nr_accesses_to_accesses_bp() shouldn't be called with such data, and the non-test code avoids that by checking the case on damon_update_monitoring_results(). Skip the test code in the case, and add an explicit caution of the case on the comment for the test target function. Link: https://lkml.kernel.org/r/[email protected] Fixes: 5e06ad590096 ("mm/damon/core-test: test max_nr_accesses overflow caused divide-by-zero") Signed-off-by: SeongJae Park <[email protected]> Reported-by: Guenter Roeck <[email protected]> Closes: https://lore.kernel.org/[email protected] Cc: Brendan Higgins <[email protected]> Cc: David Gow <[email protected]> Cc: Guenter Roeck <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-09uprobes: use vm_special_mapping close() functionalitySven Schnelle3-21/+17
The following KASAN splat was shown: [ 44.505448] ================================================================== 20:37:27 [3421/145075] [ 44.505455] BUG: KASAN: slab-use-after-free in special_mapping_close+0x9c/0xc8 [ 44.505471] Read of size 8 at addr 00000000868dac48 by task sh/1384 [ 44.505479] [ 44.505486] CPU: 51 UID: 0 PID: 1384 Comm: sh Not tainted 6.11.0-rc6-next-20240902-dirty #1496 [ 44.505503] Hardware name: IBM 3931 A01 704 (z/VM 7.3.0) [ 44.505508] Call Trace: [ 44.505511] [<000b0324d2f78080>] dump_stack_lvl+0xd0/0x108 [ 44.505521] [<000b0324d2f5435c>] print_address_description.constprop.0+0x34/0x2e0 [ 44.505529] [<000b0324d2f5464c>] print_report+0x44/0x138 [ 44.505536] [<000b0324d1383192>] kasan_report+0xc2/0x140 [ 44.505543] [<000b0324d2f52904>] special_mapping_close+0x9c/0xc8 [ 44.505550] [<000b0324d12c7978>] remove_vma+0x78/0x120 [ 44.505557] [<000b0324d128a2c6>] exit_mmap+0x326/0x750 [ 44.505563] [<000b0324d0ba655a>] __mmput+0x9a/0x370 [ 44.505570] [<000b0324d0bbfbe0>] exit_mm+0x240/0x340 [ 44.505575] [<000b0324d0bc0228>] do_exit+0x548/0xd70 [ 44.505580] [<000b0324d0bc1102>] do_group_exit+0x132/0x390 [ 44.505586] [<000b0324d0bc13b6>] __s390x_sys_exit_group+0x56/0x60 [ 44.505592] [<000b0324d0adcbd6>] do_syscall+0x2f6/0x430 [ 44.505599] [<000b0324d2f78434>] __do_syscall+0xa4/0x170 [ 44.505606] [<000b0324d2f9454c>] system_call+0x74/0x98 [ 44.505614] [ 44.505616] Allocated by task 1384: [ 44.505621] kasan_save_stack+0x40/0x70 [ 44.505630] kasan_save_track+0x28/0x40 [ 44.505636] __kasan_kmalloc+0xa0/0xc0 [ 44.505642] __create_xol_area+0xfa/0x410 [ 44.505648] get_xol_area+0xb0/0xf0 [ 44.505652] uprobe_notify_resume+0x27a/0x470 [ 44.505657] irqentry_exit_to_user_mode+0x15e/0x1d0 [ 44.505664] pgm_check_handler+0x122/0x170 [ 44.505670] [ 44.505672] Freed by task 1384: [ 44.505676] kasan_save_stack+0x40/0x70 [ 44.505682] kasan_save_track+0x28/0x40 [ 44.505687] kasan_save_free_info+0x4a/0x70 [ 44.505693] __kasan_slab_free+0x5a/0x70 [ 44.505698] kfree+0xe8/0x3f0 [ 44.505704] __mmput+0x20/0x370 [ 44.505709] exit_mm+0x240/0x340 [ 44.505713] do_exit+0x548/0xd70 [ 44.505718] do_group_exit+0x132/0x390 [ 44.505722] __s390x_sys_exit_group+0x56/0x60 [ 44.505727] do_syscall+0x2f6/0x430 [ 44.505732] __do_syscall+0xa4/0x170 [ 44.505738] system_call+0x74/0x98 The problem is that uprobe_clear_state() kfree's struct xol_area, which contains struct vm_special_mapping *xol_mapping. This one is passed to _install_special_mapping() in xol_add_vma(). __mput reads: static inline void __mmput(struct mm_struct *mm) { VM_BUG_ON(atomic_read(&mm->mm_users)); uprobe_clear_state(mm); exit_aio(mm); ksm_exit(mm); khugepaged_exit(mm); /* must run before exit_mmap */ exit_mmap(mm); ... } So uprobe_clear_state() in the beginning free's the memory area containing the vm_special_mapping data, but exit_mmap() uses this address later via vma->vm_private_data (which was set in _install_special_mapping(). Fix this by moving uprobe_clear_state() to uprobes.c and use it as close() callback. [[email protected]: remove unneeded condition] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: 223febc6e557 ("mm: add optional close() to struct vm_special_mapping") Signed-off-by: Sven Schnelle <[email protected]> Suggested-by: Linus Torvalds <[email protected]> Cc: Adrian Hunter <[email protected]> Cc: Alexander Shishkin <[email protected]> Cc: Arnaldo Carvalho de Melo <[email protected]> Cc: Ian Rogers <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Kan Liang <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Masami Hiramatsu <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Namhyung Kim <[email protected]> Cc: Oleg Nesterov <[email protected]> Cc: Peter Zijlstra <[email protected]> Signed-off-by: Andrew Morton <[email protected]>