aboutsummaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)AuthorFilesLines
2023-10-05USB: dma: remove unused function prototypeRandy Li1-16/+0
usb_buffer_map_sg() and usb_buffer_unmap_sg() have no definition since the beginning of v5.4. The rest are gone from 2.6.12. Signed-off-by: Randy Li <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
2023-10-05xfrm: Annotate struct xfrm_sec_ctx with __counted_byKees Cook1-1/+2
Prepare for the coming implementation by GCC and Clang of the __counted_by attribute. Flexible array members annotated with __counted_by can have their accesses bounds-checked at run-time via CONFIG_UBSAN_BOUNDS (for array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family functions). As found with Coccinelle[1], add __counted_by for struct xfrm_sec_ctx. Cc: Steffen Klassert <[email protected]> Cc: Herbert Xu <[email protected]> Cc: "David S. Miller" <[email protected]> Cc: [email protected] Link: https://github.com/kees/kernel-tools/blob/trunk/coccinelle/examples/counted_by.cocci [1] Signed-off-by: Kees Cook <[email protected]> Reviewed-by: Gustavo A. R. Silva <[email protected]> Signed-off-by: Steffen Klassert <[email protected]>
2023-10-04f2fs: Support Block Size == Page SizeDaniel Rosenberg1-28/+41
This allows f2fs to support cases where the block size = page size for both 4K and 16K block sizes. Other sizes should work as well, should the need arise. This does not currently support 4K Block size filesystems if the page size is 16K. Signed-off-by: Daniel Rosenberg <[email protected]> Signed-off-by: Jaegeuk Kim <[email protected]>
2023-10-04tcp: fix quick-ack counting to count actual ACKs of new dataNeal Cardwell1-2/+4
This commit fixes quick-ack counting so that it only considers that a quick-ack has been provided if we are sending an ACK that newly acknowledges data. The code was erroneously using the number of data segments in outgoing skbs when deciding how many quick-ack credits to remove. This logic does not make sense, and could cause poor performance in request-response workloads, like RPC traffic, where requests or responses can be multi-segment skbs. When a TCP connection decides to send N quick-acks, that is to accelerate the cwnd growth of the congestion control module controlling the remote endpoint of the TCP connection. That quick-ack decision is purely about the incoming data and outgoing ACKs. It has nothing to do with the outgoing data or the size of outgoing data. And in particular, an ACK only serves the intended purpose of allowing the remote congestion control to grow the congestion window quickly if the ACK is ACKing or SACKing new data. The fix is simple: only count packets as serving the goal of the quickack mechanism if they are ACKing/SACKing new data. We can tell whether this is the case by checking inet_csk_ack_scheduled(), since we schedule an ACK exactly when we are ACKing/SACKing new data. Fixes: fc6415bcb0f5 ("[TCP]: Fix quick-ack decrementing with TSO.") Signed-off-by: Neal Cardwell <[email protected]> Reviewed-by: Yuchung Cheng <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2023-10-04Merge tag 'nf-23-10-04' of ↵Jakub Kicinski1-0/+1
https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf Florian Westphal says: ==================== netfilter patches for net First patch resolves a regression with vlan header matching, this was broken since 6.5 release. From myself. Second patch fixes an ancient problem with sctp connection tracking in case INIT_ACK packets are delayed. This comes with a selftest, both patches from Xin Long. Patch 4 extends the existing nftables audit selftest, from Phil Sutter. Patch 5, also from Phil, avoids a situation where nftables would emit an audit record twice. This was broken since 5.13 days. Patch 6, from myself, avoids spurious insertion failure if we encounter an overlapping but expired range during element insertion with the 'nft_set_rbtree' backend. This problem exists since 6.2. * tag 'nf-23-10-04' of https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: netfilter: nf_tables: nft_set_rbtree: fix spurious insertion failure netfilter: nf_tables: Deduplicate nft_register_obj audit logs selftests: netfilter: Extend nft_audit.sh selftests: netfilter: test for sctp collision processing in nf_conntrack netfilter: handle the connecting collision properly in nf_conntrack_proto_sctp netfilter: nft_payload: rebuild vlan header on h_proto access ==================== Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2023-10-04dt-bindings: power: Update prefixes for AON power domainChanghuang Liang1-2/+3
Use "JH7110_AON_PD_" prefix for AON power domain for JH7110 SoC. Reviewed-by: Walker Chen <[email protected]> Signed-off-by: Changhuang Liang <[email protected]> Acked-by: Conor Dooley <[email protected]> Reviewed-by: Geert Uytterhoeven <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Ulf Hansson <[email protected]>
2023-10-04page_pool: fix documentation typosRandy Dunlap1-3/+3
Correct grammar for better readability. Signed-off-by: Randy Dunlap <[email protected]> Cc: Jesper Dangaard Brouer <[email protected]> Reviewed-by: Simon Horman <[email protected]> Acked-by: Ilias Apalodimas <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2023-10-04eventfs: Remove eventfs_file and just use eventfs_inodeSteven Rostedt (Google)2-16/+15
Instead of having a descriptor for every file represented in the eventfs directory, only have the directory itself represented. Change the API to send in a list of entries that represent all the files in the directory (but not other directories). The entry list contains a name and a callback function that will be used to create the files when they are accessed. struct eventfs_inode *eventfs_create_events_dir(const char *name, struct dentry *parent, const struct eventfs_entry *entries, int size, void *data); is used for the top level eventfs directory, and returns an eventfs_inode that will be used by: struct eventfs_inode *eventfs_create_dir(const char *name, struct eventfs_inode *parent, const struct eventfs_entry *entries, int size, void *data); where both of the above take an array of struct eventfs_entry entries for every file that is in the directory. The entries are defined by: typedef int (*eventfs_callback)(const char *name, umode_t *mode, void **data, const struct file_operations **fops); struct eventfs_entry { const char *name; eventfs_callback callback; }; Where the name is the name of the file and the callback gets called when the file is being created. The callback passes in the name (in case the same callback is used for multiple files), a pointer to the mode, data and fops. The data will be pointing to the data that was passed in eventfs_create_dir() or eventfs_create_events_dir() but may be overridden to point to something else, as it will be used to point to the inode->i_private that is created. The information passed back from the callback is used to create the dentry/inode. If the callback fills the data and the file should be created, it must return a positive number. On zero or negative, the file is ignored. This logic may also be used as a prototype to convert entire pseudo file systems into just-in-time allocation. The "show_events_dentry" file has been updated to show the directories, and any files they have. With just the eventfs_file allocations: Before after deltas for meminfo (in kB): MemFree: -14360 MemAvailable: -14260 Buffers: 40 Cached: 24 Active: 44 Inactive: 48 Inactive(anon): 28 Active(file): 44 Inactive(file): 20 Dirty: -4 AnonPages: 28 Mapped: 4 KReclaimable: 132 Slab: 1604 SReclaimable: 132 SUnreclaim: 1472 Committed_AS: 12 Before after deltas for slabinfo: <slab>: <objects> [ * <size> = <total>] ext4_inode_cache 27 [* 1184 = 31968 ] extent_status 102 [* 40 = 4080 ] tracefs_inode_cache 144 [* 656 = 94464 ] buffer_head 39 [* 104 = 4056 ] shmem_inode_cache 49 [* 800 = 39200 ] filp -53 [* 256 = -13568 ] dentry 251 [* 192 = 48192 ] lsm_file_cache 277 [* 32 = 8864 ] vm_area_struct -14 [* 184 = -2576 ] trace_event_file 1748 [* 88 = 153824 ] kmalloc-1k 35 [* 1024 = 35840 ] kmalloc-256 49 [* 256 = 12544 ] kmalloc-192 -28 [* 192 = -5376 ] kmalloc-128 -30 [* 128 = -3840 ] kmalloc-96 10581 [* 96 = 1015776 ] kmalloc-64 3056 [* 64 = 195584 ] kmalloc-32 1291 [* 32 = 41312 ] kmalloc-16 2310 [* 16 = 36960 ] kmalloc-8 9216 [* 8 = 73728 ] Free memory dropped by 14,360 kB Available memory dropped by 14,260 kB Total slab additions in size: 1,771,032 bytes With this change: Before after deltas for meminfo (in kB): MemFree: -12084 MemAvailable: -11976 Buffers: 32 Cached: 32 Active: 72 Inactive: 168 Inactive(anon): 176 Active(file): 72 Inactive(file): -8 Dirty: 24 AnonPages: 196 Mapped: 8 KReclaimable: 148 Slab: 836 SReclaimable: 148 SUnreclaim: 688 Committed_AS: 324 Before after deltas for slabinfo: <slab>: <objects> [ * <size> = <total>] tracefs_inode_cache 144 [* 656 = 94464 ] shmem_inode_cache -23 [* 800 = -18400 ] filp -92 [* 256 = -23552 ] dentry 179 [* 192 = 34368 ] lsm_file_cache -3 [* 32 = -96 ] vm_area_struct -13 [* 184 = -2392 ] trace_event_file 1748 [* 88 = 153824 ] kmalloc-1k -49 [* 1024 = -50176 ] kmalloc-256 -27 [* 256 = -6912 ] kmalloc-128 1864 [* 128 = 238592 ] kmalloc-64 4685 [* 64 = 299840 ] kmalloc-32 -72 [* 32 = -2304 ] kmalloc-16 256 [* 16 = 4096 ] total = 721352 Free memory dropped by 12,084 kB Available memory dropped by 11,976 kB Total slab additions in size: 721,352 bytes That's over 2 MB in savings per instance for free and available memory, and over 1 MB in savings per instance of slab memory. Link: https://lore.kernel.org/linux-trace-kernel/[email protected] Link: https://lore.kernel.org/linux-trace-kernel/[email protected] Cc: Masami Hiramatsu <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Andrew Morton <[email protected]> Cc: Ajay Kaher <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2023-10-04rcu: Standardize explicit CPU-hotplug callsFrederic Weisbecker4-5/+8
rcu_report_dead() and rcutree_migrate_callbacks() have their headers in rcupdate.h while those are pure rcutree calls, like the other CPU-hotplug functions. Also rcu_cpu_starting() and rcu_report_dead() have different naming conventions while they mirror each other's effects. Fix the headers and propose a naming that relates both functions and aligns with the prefix of other rcutree CPU-hotplug functions. Reviewed-by: Paul E. McKenney <[email protected]> Signed-off-by: Frederic Weisbecker <[email protected]>
2023-10-04rcu: Conditionally build CPU-hotplug teardown callbacksFrederic Weisbecker1-2/+9
Among the three CPU-hotplug teardown RCU callbacks, two of them early exit if CONFIG_HOTPLUG_CPU=n, and one is left unchanged. In any case all of them have an implementation when CONFIG_HOTPLUG_CPU=n. Align instead with the common way to deal with CPU-hotplug teardown callbacks and provide a proper stub when they are not supported. Reviewed-by: Paul E. McKenney <[email protected]> Signed-off-by: Frederic Weisbecker <[email protected]>
2023-10-04RDMA/core: Fix a couple of obvious typos in commentsChuck Lever1-1/+1
Fix typos. Signed-off-by: Chuck Lever <[email protected]> Link: https://lore.kernel.org/r/169643338101.8035.6826446669479247727.stgit@manet.1015granger.net Signed-off-by: Leon Romanovsky <[email protected]>
2023-10-04net: appletalk: remove cops supportGreg Kroah-Hartman1-1/+0
The COPS Appletalk support is very old, never said to actually work properly, and the firmware code for the devices are under a very suspect license. Remove it all to clear up the license issue, if it is still needed and actually used by anyone, we can add it back later once the license is cleared up. Reported-by: Prarit Bhargava <[email protected]> Cc: [email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]> Acked-by: Christoph Hellwig <[email protected]> Acked-by: Prarit Bhargava <[email protected]> Reviewed-by: Vitaly Kuznetsov <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2023-10-04Merge tag 'wireless-2023-09-27' of ↵Jakub Kicinski1-2/+4
git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless Johannes Berg says: ==================== Quite a collection of fixes this time, really too many to list individually. Many stack fixes, even rfkill (found by simulation and the new eevdf scheduler)! Also a bigger maintainers file cleanup, to remove old and redundant information. * tag 'wireless-2023-09-27' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless: (32 commits) wifi: iwlwifi: mvm: Fix incorrect usage of scan API wifi: mac80211: Create resources for disabled links wifi: cfg80211: avoid leaking stack data into trace wifi: mac80211: allow transmitting EAPOL frames with tainted key wifi: mac80211: work around Cisco AP 9115 VHT MPDU length wifi: cfg80211: Fix 6GHz scan configuration wifi: mac80211: fix potential key leak wifi: mac80211: fix potential key use-after-free wifi: mt76: mt76x02: fix MT76x0 external LNA gain handling wifi: brcmfmac: Replace 1-element arrays with flexible arrays wifi: mwifiex: Fix oob check condition in mwifiex_process_rx_packet wifi: rtw88: rtw8723d: Fix MAC address offset in EEPROM rfkill: sync before userspace visibility/changes wifi: mac80211: fix mesh id corruption on 32 bit systems wifi: cfg80211: add missing kernel-doc for cqm_rssi_work wifi: cfg80211: fix cqm_config access race wifi: iwlwifi: mvm: Fix a memory corruption issue wifi: iwlwifi: Ensure ack flag is properly cleared. wifi: iwlwifi: dbg_ini: fix structure packing iwlwifi: mvm: handle PS changes in vif_cfg_changed ... ==================== Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2023-10-04IPsec packet offload support in multiport RoCE devicesLeon Romanovsky34-117/+177
This series from Patrisious extends mlx5 to support IPsec packet offload in multiport devices (MPV, see [1] for more details). These devices have single flow steering logic and two netdev interfaces, which require extra logic to manage IPsec configurations as they performed on netdevs. Thanks [1] https://lore.kernel.org/linux-rdma/[email protected]/ Link: https://lore.kernel.org/all/[email protected] Signed-of-by: Leon Romanovsky <[email protected]> * mlx5-next: (576 commits) net/mlx5: Handle IPsec steering upon master unbind/bind net/mlx5: Configure IPsec steering for ingress RoCEv2 MPV traffic net/mlx5: Configure IPsec steering for egress RoCEv2 MPV traffic net/mlx5: Add create alias flow table function to ipsec roce net/mlx5: Implement alias object allow and create functions net/mlx5: Add alias flow table bits net/mlx5: Store devcom pointer inside IPsec RoCE net/mlx5: Register mlx5e priv to devcom in MPV mode RDMA/mlx5: Send events from IB driver about device affiliation state net/mlx5: Introduce ifc bits for migration in a chunk mode Linux 6.6-rc3 ...
2023-10-04crash_core.c: remove unneeded functionsBaoquan He1-4/+0
So far, nobody calls functions parse_crashkernel_high() and parse_crashkernel_low(), remove both of them. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Baoquan He <[email protected]> Reviewed-by: Zhen Lei <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Chen Jiahao <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-04crash_core: move crashk_*res definition into crash_core.cBaoquan He2-4/+8
Both crashk_res and crashk_low_res are used to mark the reserved crashkernel regions in iomem_resource tree. And later the generic crashkernel resrvation will be added into crash_core.c. So move crashk_res and crashk_low_res definition into crash_core.c to avoid compiling error if CONFIG_CRASH_CORE=on while CONFIG_KEXEC_CORE is unset. Meanwhile include <asm/crash_core.h> in <linux/crash_core.h> if generic reservation is needed. In that case, <asm/crash_core.h> need be added by ARCH. In asm/crash_core.h, ARCH can provide its own macro definitions to override macros in <linux/crash_core.h> if needed. Wrap the including into CONFIG_ARCH_HAS_GENERIC_CRASHKERNEL_RESERVATION ifdeffery scope to avoid compiling error in other ARCH-es which don't take the generic reservation way yet. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Baoquan He <[email protected]> Reviewed-by: Zhen Lei <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Chen Jiahao <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-04crash_core: add generic function to do reservationBaoquan He1-0/+28
In architecture like x86_64, arm64 and riscv, they have vast virtual address space and usually have huge physical memory RAM. Their crashkernel reservation doesn't have to be limited under 4G RAM, but can be extended to the whole physical memory via crashkernel=,high support. Now add function reserve_crashkernel_generic() to reserve crashkernel memory if users specify any case of kernel pamameters, like crashkernel=xM[@offset] or crashkernel=,high|low. This is preparation to simplify code of crashkernel=,high support in architecutures. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Baoquan He <[email protected]> Reviewed-by: Zhen Lei <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Chen Jiahao <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-04crash_core: change parse_crashkernel() to support crashkernel=,high|low parsingBaoquan He1-0/+6
Now parse_crashkernel() is a real entry point for all kinds of crahskernel parsing on any architecture. And wrap the crahskernel=,high|low handling inside CONFIG_ARCH_HAS_GENERIC_CRASHKERNEL_RESERVATION ifdeffery scope. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Baoquan He <[email protected]> Reviewed-by: Zhen Lei <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Chen Jiahao <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-04crash_core: change the prototype of function parse_crashkernel()Baoquan He1-1/+2
Add two parameters 'low_size' and 'high' to function parse_crashkernel(), later crashkernel=,high|low parsing will be added. Make adjustments in all call sites of parse_crashkernel() in arch. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Baoquan He <[email protected]> Reviewed-by: Zhen Lei <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Chen Jiahao <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-04minmax: fix header inclusionsAndy Shevchenko1-1/+2
BUILD_BUG_ON*() macros are defined in build_bug.h. Include it. Replace compiler_types.h by compiler.h, which provides the former, to have a definition of the __UNIQUE_ID(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Andy Shevchenko <[email protected]> Reviewed-by: Herve Codina <[email protected]> Cc: Rasmus Villemoes <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-04minmax: deduplicate __unconst_integer_typeof()Andy Shevchenko1-23/+3
It appears that compiler_types.h already have an implementation of the __unconst_integer_typeof() called __unqual_scalar_typeof(). Use it instead of the copy. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Andy Shevchenko <[email protected]> Acked-by: Herve Codina <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-04kthread: add kthread_stop_putAndreas Gruenbacher1-0/+1
Add a kthread_stop_put() helper that stops a thread and puts its task struct. Use it to replace the various instances of kthread_stop() followed by put_task_struct(). Remove the kthread_stop_put() macro in usbip that is similar but doesn't return the result of kthread_stop(). [[email protected]: fix kerneldoc comment] Link: https://lkml.kernel.org/r/[email protected] [[email protected]: document kthread_stop_put()'s argument] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Andreas Gruenbacher <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-04seq_file: add helper macro to define attribute for rw fileXingui Yang1-0/+15
Patch series "Add helper macro DEFINE_SHOW_STORE_ATTRIBUTE() at seq_file.c", v6. We already own DEFINE_SHOW_ATTRIBUTE() helper macro for defining attribute for read-only file, but we found many of drivers also want a helper macro for read-write file too. So we add this helper macro to reduce duplicated code. This patch (of 3): We already own DEFINE_SHOW_ATTRIBUTE() helper macro for defining attribute for read-only file, but many of drivers want a helper macro for read-write file too. So we add DEFINE_SHOW_STORE_ATTRIBUTE() helper to reduce duplicated code. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Luo Jiaxing <[email protected]> Co-developed-by: Xingui Yang <[email protected]> Signed-off-by: Xingui Yang <[email protected]> Reviewed-by: Andy Shevchenko <[email protected]> Cc: Al Viro <[email protected]> Cc: Animesh Manna <[email protected]> Cc: Anshuman Gupta <[email protected]> Cc: Damien Le Moal <[email protected]> Cc: Felipe Balbi <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Himanshu Madhani <[email protected]> Cc: James Bottomley <[email protected]> Cc: John Garry <[email protected]> Cc: Martin K. Petersen <[email protected]> Cc: Uma Shankar <[email protected]> Cc: Xiang Chen <[email protected]> Cc: Zeng Tao <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-04kill task_struct->thread_groupOleg Nesterov1-1/+0
The last user was removed by the previous patch. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Oleg Nesterov <[email protected]> Cc: Eric W. Biederman <[email protected]> Cc: Peter Zijlstra <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-04change thread_group_empty() to use task_struct->thread_nodeOleg Nesterov1-1/+2
Patch series "kill task_struct->thread_group". This patch (of 2): It could use list_is_singular() but this way it is cheaper. Plus the thread_group_leader() check makes it clear that thread_group_empty() can only return true if p is a group leader. This was not immediately obvious before this patch. task_struct->thread_group no longer has users, it can die. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Oleg Nesterov <[email protected]> Cc: Eric W. Biederman <[email protected]> Cc: Peter Zijlstra <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-04change next_thread() to use __next_thread() ?: group_leaderOleg Nesterov1-3/+2
This relies on fact that group leader is always the 1st entry in the signal->thread_head list. With or without this change, if the lockless next_thread(last_thread) races with exec it can return the old or the new leader. We are almost ready to kill task->thread_group, after this change its only user is thread_group_empty(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Oleg Nesterov <[email protected]> Cc: Eric W. Biederman <[email protected]> Cc: Peter Zijlstra <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-04introduce __next_thread(), fix next_tid() vs exec() raceOleg Nesterov1-0/+11
Patch series "introduce __next_thread(), change next_thread()". After commit dce8f8ed1de1 ("document while_each_thread(), change first_tid() to use for_each_thread()") + this series 1. We have only one lockless user of next_thread(), task_group_seq_get_next(). I think it should be changed too. 2. We have only one user of task_struct->thread_group, thread_group_empty(). The next patches will change thread_group_empty() and kill ->thread_group. This patch (of 2): next_tid(start) does: rcu_read_lock(); if (pid_alive(start)) { pos = next_thread(start); if (thread_group_leader(pos)) pos = NULL; else get_task_struct(pos); it should return pos = NULL when next_thread() wraps to the 1st thread in the thread group, group leader, and the thread_group_leader() check tries to detect this case. But this can race with exec. To simplify, suppose we have a main thread M and a single sub-thread T, next_tid(T) should return NULL. Now suppose that T execs. If next_tid(T) is called after T changes the leadership and before it does release_task() which removes the old leader from list, then next_thread() returns M and thread_group_leader(M) = F. Lockless use of next_thread() should be avoided. After this change only task_group_seq_get_next() does this, and I believe it should be changed as well. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Oleg Nesterov <[email protected]> Cc: Eric W. Biederman <[email protected]> Cc: Peter Zijlstra <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-04compiler.h: unify __UNIQUE_IDNick Desaulniers3-11/+1
commit 6f33d58794ef ("__UNIQUE_ID()") added a fallback definition of __UNIQUE_ID because gcc 4.2 and older did not support __COUNTER__. Also, this commit is effectively a revert of commit b41c29b0527c ("Kbuild: provide a __UNIQUE_ID for clang") which mentions clang 2.6+ supporting __COUNTER__. Documentation/process/changes.rst currently lists the minimum supported version of these compilers as: - gcc: 5.1 - clang: 11.0.0 It should be safe to say that __COUNTER__ is well supported by this point. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Nick Desaulniers <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Jan Beulich <[email protected]> Cc: Luc Van Oostenryck <[email protected]> Cc: Michal rarek <[email protected]> Cc: Nathan Chancellor <[email protected]> Cc: Paul Russel <[email protected]> Cc: Tom Rix <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-04mm/damon/core: implement scheme-specific apply intervalSeongJae Park1-2/+15
DAMON-based operation schemes are applied for every aggregation interval. That was mainly because schemes were using nr_accesses, which be complete to be used for every aggregation interval. However, the schemes are now using nr_accesses_bp, which is updated for each sampling interval in a way that reasonable to be used. Therefore, there is no reason to apply schemes for each aggregation interval. The unnecessary alignment with aggregation interval was also making some use cases of DAMOS tricky. Quotas setting under long aggregation interval is one such example. Suppose the aggregation interval is ten seconds, and there is a scheme having CPU quota 100ms per 1s. The scheme will actually uses 100ms per ten seconds, since it cannobe be applied before next aggregation interval. The feature is working as intended, but the results might not that intuitive for some users. This could be fixed by updating the quota to 1s per 10s. But, in the case, the CPU usage of DAMOS could look like spikes, and would actually make a bad effect to other CPU-sensitive workloads. Implement a dedicated timing interval for each DAMON-based operation scheme, namely apply_interval. The interval will be sampling interval aligned, and each scheme will be applied for its apply_interval. The interval is set to 0 by default, and it means the scheme should use the aggregation interval instead. This avoids old users getting any behavioral difference. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Steven Rostedt (Google) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-04mm/damon/core: use nr_accesses_bp as a source of damos_before_apply tracepointSeongJae Park1-1/+1
damos_before_apply tracepoint is exposing access rate of DAMON regions using nr_accesses field of regions, which was actually used by DAMOS in the past. However, it has changed to use nr_accesses_bp instead. Update the tracepoint to expose the value that DAMOS is really using. Note that it doesn't expose the value as is in the basis point, but after converting it to the natural number by dividing it by 10,000. Therefore this change doesn't make user-visible behavioral differences. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Steven Rostedt (Google) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-04memblock: introduce MEMBLOCK_RSRV_NOINIT flagUsama Arif1-0/+9
For reserved memory regions marked with this flag, reserve_bootmem_region is not called during memmap_init_reserved_pages. This can be used to avoid struct page initialization for regions which won't need them, for e.g. hugepages with Hugepage Vmemmap Optimization enabled. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Usama Arif <[email protected]> Acked-by: Muchun Song <[email protected]> Reviewed-by: Mike Rapoport (IBM) <[email protected]> Cc: Fam Zheng <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Punit Agrawal <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-04mm/damon/core: mark damon_moving_sum() as a static functionSeongJae Park1-2/+0
The function is used by only mm/damon/core.c. Mark it as a static function. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Cc: Brendan Higgins <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-04mm/damon/core: use pseudo-moving sum for nr_accesses_bpSeongJae Park1-3/+9
Let nr_accesses_bp be calculated as a pseudo-moving sum that updated for every sampling interval, using damon_moving_sum(). This is assumed to be useful for cases that the aggregation interval is set quite huge, but the monivoting results need to be collected earlier than next aggregation interval is passed. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Cc: Brendan Higgins <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-04mm/damon/core: introduce nr_accesses_bpSeongJae Park1-0/+5
Add yet another representation of the access rate of each region, namely nr_accesses_bp. It is just same to the nr_accesses but represents the value in basis point (1 in 10,000), and updated at once in every aggregation interval. That is, moving_accesses_bp is just nr_accesses * 10000. This may seems useless at the moment. However, it will be useful for representing less than one nr_accesses value that will be needed to make moving sum-based nr_accesses. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Cc: Brendan Higgins <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-04mm/damon/core: implement a pseudo-moving sum functionSeongJae Park1-0/+2
For values that continuously change, moving average or sum are good ways to provide fast updates while handling temporal and errorneous variability of the value. For example, the access rate counter (nr_accesses) is calculated as a sum of the number of positive sampled access check results that collected during a discrete time window (aggregation interval), and hence it handles temporal and errorneous access check results, but provides the update only for every aggregation interval. Using a moving sum method for that could allow providing the value for every sampling interval. That could be useful for getting monitoring results snapshot or running DAMOS in fine-grained timing. However, supporting the moving sum for cases that number of samples in the time window is arbirary could impose high overhead, since the number of past values that it needs to keep could be too high. The nr_accesses would also be one of the cases. To mitigate the overhead, implement a pseudo-moving sum function that only provides an estimated pseudo-moving sum. It assumes there was no error in last discrete time window and subtract constant portion of last discrete time window sum. Note that the function is not strictly implementing the moving sum, but it keeps a property of moving sum, which makes the value same to the dsicrete-window based sum for each time window-aligned timing. Hence, people collecting the value in the old timings would show no difference. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Cc: Brendan Higgins <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-04mm/damon/core: define and use a dedicated function for region access rate updateSeongJae Park1-1/+4
Patch series "mm/damon: provide pseudo-moving sum based access rate". DAMON checks the access to each region for every sampling interval, increase the access rate counter of the region, namely nr_accesses, if the access was made. For every aggregation interval, the counter is reset. The counter is exposed to users to be used as a metric showing the relative access rate (frequency) of each region. In other words, DAMON provides access rate of each region in every aggregation interval. The aggregation avoids temporal access pattern changes making things confusing. However, this also makes a few DAMON-related operations to unnecessarily need to be aligned to the aggregation interval. This can restrict the flexibility of DAMON applications, especially when the aggregation interval is huge. To provide the monitoring results in finer-grained timing while keeping handling of temporal access pattern change, this patchset implements a pseudo-moving sum based access rate metric. It is pseudo-moving sum because strict moving sum implementation would need to keep all values for last time window, and that could incur high overhead of there could be arbitrary number of values in a time window. Especially in case of the nr_accesses, since the sampling interval and aggregation interval can arbitrarily set and the past values should be maintained for every region, it could be risky. The pseudo-moving sum assumes there were no temporal access pattern change in last discrete time window to remove the needs for keeping the list of the last time window values. As a result, it beocmes not strict moving sum implementation, but provides a reasonable accuracy. Also, it keeps an important property of the moving sum. That is, the moving sum becomes same to discrete-window based sum at the time that aligns to the time window. This means using the pseudo moving sum based nr_accesses makes no change to users who shows the value for every aggregation interval. Patches Sequence ---------------- The sequence of the patches is as follows. The first four patches are for preparation of the change. The first two (patches 1 and 2) implements a helper function for nr_accesses update and eliminate corner case that skips use of the function, respectively. Following two (patches 3 and 4) respectively implement the pseudo-moving sum function and its simple unit test case. Two patches for making DAMON to use the pseudo-moving sum follow. The fifthe one (patch 5) introduces a new field for representing the pseudo-moving sum-based access rate of each region, and the sixth one makes the new representation to actually updated with the pseudo-moving sum function. Last two patches (patches 7 and 8) makes followup fixes for skipping unnecessary updates and marking the moving sum function as static, respectively. This patch (of 8): Each DAMON operarions set is updating nr_accesses field of each damon_region for each of their access check results, from the check_accesses() callback. Directly accessing the field could make things complex to manage and change in future. Define and use a dedicated function for the purpose. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Cc: Brendan Higgins <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-04mm/damon/core: use number of passed access sampling as a timerSeongJae Park1-2/+12
DAMON sleeps for sampling interval after each sampling, and check if the aggregation interval and the ops update interval have passed using ktime_get_coarse_ts64() and baseline timestamps for the intervals. That design is for making the operations occur at deterministic timing regardless of the time that spend for each work. However, it turned out it is not that useful, and incur not-that-intuitive results. After all, timer functions, and especially sleep functions that DAMON uses to wait for specific timing, are not necessarily strictly accurate. It is legal design, so no problem. However, depending on such inaccuracies, the nr_accesses can be larger than aggregation interval divided by sampling interval. For example, with the default setting (5 ms sampling interval and 100 ms aggregation interval) we frequently show regions having nr_accesses larger than 20. Also, if the execution of a DAMOS scheme takes a long time, next aggregation could happen before enough number of samples are collected. This is not what usual users would intuitively expect. Since access check sampling is the smallest unit work of DAMON, using the number of passed sampling intervals as the DAMON-internal timer can easily avoid these problems. That is, convert aggregation and ops update intervals to numbers of sampling intervals that need to be passed before those operations be executed, count the number of passed sampling intervals, and invoke the operations as soon as the specific amount of sampling intervals passed. Make the change. Note that this could make a behavioral change to settings that using intervals that not aligned by the sampling interval. For example, if the sampling interval is 5 ms and the aggregation interval is 12 ms, DAMON effectively uses 15 ms as its aggregation interval, because it checks whether the aggregation interval after sleeping the sampling interval. This change will make DAMON to effectively use 10 ms as aggregation interval, since it uses 'aggregation interval / sampling interval * sampling interval' as the effective aggregation interval, and we don't use floating point types. Usual users would have used aligned intervals, so this behavioral change is not expected to make any meaningful impact, so just make this change. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-04mm, vmscan: remove ISOLATE_UNMAPPEDVlastimil Babka2-8/+2
This isolate_mode_t flag is effectively unused since 89f6c88a6ab4 ("mm: __isolate_lru_page_prepare() in isolate_migratepages_block()") as sc->may_unmap is now checked directly (and only node_reclaim has a mode that sets it to 0). The last remaining place is mm_vmscan_lru_isolate tracepoint for the isolate_mode parameter. That one was mainly used to indicate the active/inactive mode, which the trace-vmscan-postprocess.pl script consumed, but that got silently broken. After fixing the script by the previous patch, it does not need the isolate_mode anymore. So just remove the parameter and with that the whole ISOLATE_UNMAPPED flag. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Vlastimil Babka <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-04buffer: remove __getblk_gfp()Matthew Wilcox (Oracle)1-2/+0
Inline it into __bread_gfp(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Cc: Hui Zhu <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-04ext4: call bdev_getblk() from sb_getblk_gfp()Matthew Wilcox (Oracle)1-3/+3
Most of the callers of sb_getblk_gfp() already assumed that they were passing the entire GFP flags to use. Fix up the two callers that didn't, and remove the __GFP_NOFAIL from them since they both appear to correctly handle failure. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Cc: Hui Zhu <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-04buffer: convert sb_getblk() to call __getblk()Matthew Wilcox (Oracle)1-4/+3
Now that __getblk() is in the right place in the file, it is trivial to call it from sb_getblk(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Cc: Hui Zhu <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-04buffer: convert getblk_unmovable() and __getblk() to use bdev_getblk()Matthew Wilcox (Oracle)1-14/+22
Move these two functions up in the file for the benefit of the next patch, and pass in all of the GFP flags to use instead of the partial GFP flags used by __getblk_gfp(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Cc: Hui Zhu <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-04buffer: hoist GFP flags from grow_dev_page() to __getblk_gfp()Matthew Wilcox (Oracle)1-0/+2
grow_dev_page() is only called by grow_buffers(). grow_buffers() is only called by __getblk_slow() and __getblk_slow() is only called from __getblk_gfp(), so it is safe to move the GFP flags setting all the way up. With that done, add a new bdev_getblk() entry point that leaves the GFP flags the way the caller specified them. [[email protected]: fix grow_dev_page() error handling] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Cc: Hui Zhu <[email protected]> Cc: Dan Carpenter <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-04buffer: pass GFP flags to folio_alloc_buffers()Matthew Wilcox (Oracle)1-1/+1
Patch series "Add and use bdev_getblk()", v2. This patch series fixes a bug reported by Hui Zhu; see proposed patches v1 and v2: https://lore.kernel.org/linux-fsdevel/[email protected]/ https://lore.kernel.org/linux-fsdevel/[email protected]/ I decided to go in a rather different direction for this fix, and fix a related problem at the same time. I don't think there's any urgency to rush this into Linus' tree, nor have I marked it for stable. Reasonable people may disagree. This patch (of 8): Instead of creating entirely new flags, inherit them from grow_dev_page(). The other callers create the same flags that this function used to create. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Cc: Hui Zhu <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-04mm/damon/core: add a tracepoint for damos apply target regionsSeongJae Park1-0/+39
Patch series "mm/damon: add a tracepoint for damos apply target regions", v2. DAMON provides damon_aggregated tracepoint to let users record full monitoring results. Sometimes, users need to record monitoring results of specific pattern. DAMOS tried regions directory of DAMON sysfs interface allows it, but the interface is mainly designed for snapshots and therefore would be inefficient for such recording. Implement yet another tracepoint for efficient support of the usecase. This patch (of 2): DAMON provides damon_aggregated tracepoint, which exposes details of each region and its access monitoring results. It is useful for getting whole monitoring results, e.g., for recording purposes. For investigations of DAMOS, DAMON Sysfs interface provides DAMOS statistics and tried_regions directory. But, those provides only statistics and snapshots. If the scheme is frequently applied and if the user needs to know every detail of DAMOS behavior, the snapshot-based interface could be insufficient and expensive. As a last resort, userspace users need to record the all monitoring results via damon_aggregated tracepoint and simulate how DAMOS would worked. It is unnecessarily complicated. DAMON kernel API users, meanwhile, can do that easily via before_damos_apply() callback field of 'struct damon_callback', though. Add a tracepoint that will be called just after before_damos_apply() callback for more convenient investigations of DAMOS. The tracepoint exposes all details about each regions, similar to damon_aggregated tracepoint. Please note that DAMOS is currently not only for memory management but also for query-like efficient monitoring results retrievals (when 'stat' action is used). Until now, only statistics or snapshots were supported. Addition of this tracepoint allows efficient full recording of DAMOS-based filtered monitoring results. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: SeongJae Park <[email protected]> Reviewed-by: Steven Rostedt (Google) <[email protected]> [tracing] Cc: Jonathan Corbet <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-04mm: migrate: convert migrate_misplaced_page() to migrate_misplaced_folio()Kefeng Wang1-2/+2
At present, numa balance only support base page and PMD-mapped THP, but we will expand to support to migrate large folio/pte-mapped THP in the future, it is better to make migrate_misplaced_page() to take a folio instead of a page, and rename it to migrate_misplaced_folio(), it is a preparation, also this remove several compound_head() calls. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kefeng Wang <[email protected]> Reviewed-by: Zi Yan <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: "Huang, Ying" <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Mike Kravetz <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-04mm/rmap: pass folio to hugepage_add_anon_rmap()David Hildenbrand1-1/+1
Let's pass a folio; we are always mapping the entire thing. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: Muchun Song <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-04mm: shrinker: make global slab shrink locklessQi Zheng1-0/+24
The shrinker_rwsem is a global read-write lock in shrinkers subsystem, which protects most operations such as slab shrink, registration and unregistration of shrinkers, etc. This can easily cause problems in the following cases. 1) When the memory pressure is high and there are many filesystems mounted or unmounted at the same time, slab shrink will be affected (down_read_trylock() failed). Such as the real workload mentioned by Kirill Tkhai: ``` One of the real workloads from my experience is start of an overcommitted node containing many starting containers after node crash (or many resuming containers after reboot for kernel update). In these cases memory pressure is huge, and the node goes round in long reclaim. ``` 2) If a shrinker is blocked (such as the case mentioned in [1]) and a writer comes in (such as mount a fs), then this writer will be blocked and cause all subsequent shrinker-related operations to be blocked. Even if there is no competitor when shrinking slab, there may still be a problem. The down_read_trylock() may become a perf hotspot with frequent calls to shrink_slab(). Because of the poor multicore scalability of atomic operations, this can lead to a significant drop in IPC (instructions per cycle). We used to implement the lockless slab shrink with SRCU [2], but then kernel test robot reported -88.8% regression in stress-ng.ramfs.ops_per_sec test case [3], so we reverted it [4]. This commit uses the refcount+RCU method [5] proposed by Dave Chinner to re-implement the lockless global slab shrink. The memcg slab shrink is handled in the subsequent patch. For now, all shrinker instances are converted to dynamically allocated and will be freed by call_rcu(). So we can use rcu_read_{lock,unlock}() to ensure that the shrinker instance is valid. And the shrinker instance will not be run again after unregistration. So the structure that records the pointer of shrinker instance can be safely freed without waiting for the RCU read-side critical section. In this way, while we implement the lockless slab shrink, we don't need to be blocked in unregister_shrinker(). The following are the test results: stress-ng --timeout 60 --times --verify --metrics-brief --ramfs 9 & 1) Before applying this patchset: setting to a 60 second run per stressor dispatching hogs: 9 ramfs stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s (secs) (secs) (secs) (real time) (usr+sys time) ramfs 473062 60.00 8.00 279.13 7884.12 1647.59 for a 60.01s run time: 1440.34s available CPU time 7.99s user time ( 0.55%) 279.13s system time ( 19.38%) 287.12s total time ( 19.93%) load average: 7.12 2.99 1.15 successful run completed in 60.01s (1 min, 0.01 secs) 2) After applying this patchset: setting to a 60 second run per stressor dispatching hogs: 9 ramfs stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s (secs) (secs) (secs) (real time) (usr+sys time) ramfs 477165 60.00 8.13 281.34 7952.55 1648.40 for a 60.01s run time: 1440.33s available CPU time 8.12s user time ( 0.56%) 281.34s system time ( 19.53%) 289.46s total time ( 20.10%) load average: 6.98 3.03 1.19 successful run completed in 60.01s (1 min, 0.01 secs) We can see that the ops/s has hardly changed. [1]. https://lore.kernel.org/lkml/[email protected]/ [2]. https://lore.kernel.org/lkml/[email protected]/ [3]. https://lore.kernel.org/lkml/[email protected]/ [4]. https://lore.kernel.org/all/[email protected]/ [5]. https://lore.kernel.org/lkml/[email protected]/ Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Qi Zheng <[email protected]> Cc: Abhinav Kumar <[email protected]> Cc: Alasdair Kergon <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Alyssa Rosenzweig <[email protected]> Cc: Andreas Dilger <[email protected]> Cc: Andreas Gruenbacher <[email protected]> Cc: Anna Schumaker <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Bob Peterson <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Carlos Llamas <[email protected]> Cc: Chandan Babu R <[email protected]> Cc: Chao Yu <[email protected]> Cc: Chris Mason <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Christian Koenig <[email protected]> Cc: Chuck Lever <[email protected]> Cc: Coly Li <[email protected]> Cc: Dai Ngo <[email protected]> Cc: Daniel Vetter <[email protected]> Cc: Daniel Vetter <[email protected]> Cc: "Darrick J. Wong" <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Dave Hansen <[email protected]> Cc: David Airlie <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: David Sterba <[email protected]> Cc: Dmitry Baryshkov <[email protected]> Cc: Gao Xiang <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Huang Rui <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jaegeuk Kim <[email protected]> Cc: Jani Nikula <[email protected]> Cc: Jan Kara <[email protected]> Cc: Jason Wang <[email protected]> Cc: Jeff Layton <[email protected]> Cc: Jeffle Xu <[email protected]> Cc: Joel Fernandes (Google) <[email protected]> Cc: Joonas Lahtinen <[email protected]> Cc: Josef Bacik <[email protected]> Cc: Juergen Gross <[email protected]> Cc: Kent Overstreet <[email protected]> Cc: Kirill Tkhai <[email protected]> Cc: Marijn Suijten <[email protected]> Cc: "Michael S. Tsirkin" <[email protected]> Cc: Mike Snitzer <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Muchun Song <[email protected]> Cc: Muchun Song <[email protected]> Cc: Nadav Amit <[email protected]> Cc: Neil Brown <[email protected]> Cc: Oleksandr Tyshchenko <[email protected]> Cc: Olga Kornievskaia <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Rob Clark <[email protected]> Cc: Rob Herring <[email protected]> Cc: Rodrigo Vivi <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Sean Paul <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Cc: Song Liu <[email protected]> Cc: Stefano Stabellini <[email protected]> Cc: Steven Price <[email protected]> Cc: "Theodore Ts'o" <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Tomeu Vizoso <[email protected]> Cc: Tom Talpey <[email protected]> Cc: Trond Myklebust <[email protected]> Cc: Tvrtko Ursulin <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Xuan Zhuo <[email protected]> Cc: Yue Hu <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-04mm: shrinker: add a secondary array for shrinker_info::{map, nr_deferred}Qi Zheng2-11/+18
Currently, we maintain two linear arrays per node per memcg, which are shrinker_info::map and shrinker_info::nr_deferred. And we need to resize them when the shrinker_nr_max is exceeded, that is, allocate a new array, and then copy the old array to the new array, and finally free the old array by RCU. For shrinker_info::map, we do set_bit() under the RCU lock, so we may set the value into the old map which is about to be freed. This may cause the value set to be lost. The current solution is not to copy the old map when resizing, but to set all the corresponding bits in the new map to 1. This solves the data loss problem, but bring the overhead of more pointless loops while doing memcg slab shrink. For shrinker_info::nr_deferred, we will only modify it under the read lock of shrinker_rwsem, so it will not run concurrently with the resizing. But after we make memcg slab shrink lockless, there will be the same data loss problem as shrinker_info::map, and we can't work around it like the map. For such resizable arrays, the most straightforward idea is to change it to xarray, like we did for list_lru [1]. We need to do xa_store() in the list_lru_add()-->set_shrinker_bit(), but this will cause memory allocation, and the list_lru_add() doesn't accept failure. A possible solution is to pre-allocate, but the location of pre-allocation is not well determined (such as deferred_split_shrinker case). Therefore, this commit chooses to introduce the following secondary array for shrinker_info::{map, nr_deferred}: +---------------+--------+--------+-----+ | shrinker_info | unit 0 | unit 1 | ... | (secondary array) +---------------+--------+--------+-----+ | v +---------------+-----+ | nr_deferred[] | map | (leaf array) +---------------+-----+ (shrinker_info_unit) The leaf array is never freed unless the memcg is destroyed. The secondary array will be resized every time the shrinker id exceeds shrinker_nr_max. So the shrinker_info_unit can be indexed from both the old and the new shrinker_info->unit[x]. Then even if we get the old secondary array under the RCU lock, the found map and nr_deferred are also true, so the updated nr_deferred and map will not be lost. [1]. https://lore.kernel.org/all/[email protected]/ [[email protected]: unlock the &shrinker_rwsem before the call to free_shrinker_info()] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Qi Zheng <[email protected]> Reviewed-by: Muchun Song <[email protected]> Cc: Abhinav Kumar <[email protected]> Cc: Alasdair Kergon <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Alyssa Rosenzweig <[email protected]> Cc: Andreas Dilger <[email protected]> Cc: Andreas Gruenbacher <[email protected]> Cc: Anna Schumaker <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Bob Peterson <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Carlos Llamas <[email protected]> Cc: Chandan Babu R <[email protected]> Cc: Chao Yu <[email protected]> Cc: Chris Mason <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Christian Koenig <[email protected]> Cc: Chuck Lever <[email protected]> Cc: Coly Li <[email protected]> Cc: Dai Ngo <[email protected]> Cc: Daniel Vetter <[email protected]> Cc: Daniel Vetter <[email protected]> Cc: "Darrick J. Wong" <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Dave Hansen <[email protected]> Cc: David Airlie <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: David Sterba <[email protected]> Cc: Dmitry Baryshkov <[email protected]> Cc: Gao Xiang <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Huang Rui <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jaegeuk Kim <[email protected]> Cc: Jani Nikula <[email protected]> Cc: Jan Kara <[email protected]> Cc: Jason Wang <[email protected]> Cc: Jeff Layton <[email protected]> Cc: Jeffle Xu <[email protected]> Cc: Joel Fernandes (Google) <[email protected]> Cc: Joonas Lahtinen <[email protected]> Cc: Josef Bacik <[email protected]> Cc: Juergen Gross <[email protected]> Cc: Kent Overstreet <[email protected]> Cc: Kirill Tkhai <[email protected]> Cc: Marijn Suijten <[email protected]> Cc: "Michael S. Tsirkin" <[email protected]> Cc: Mike Snitzer <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Muchun Song <[email protected]> Cc: Nadav Amit <[email protected]> Cc: Neil Brown <[email protected]> Cc: Oleksandr Tyshchenko <[email protected]> Cc: Olga Kornievskaia <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Rob Clark <[email protected]> Cc: Rob Herring <[email protected]> Cc: Rodrigo Vivi <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Sean Paul <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Cc: Song Liu <[email protected]> Cc: Stefano Stabellini <[email protected]> Cc: Steven Price <[email protected]> Cc: "Theodore Ts'o" <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Tomeu Vizoso <[email protected]> Cc: Tom Talpey <[email protected]> Cc: Trond Myklebust <[email protected]> Cc: Tvrtko Ursulin <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Xuan Zhuo <[email protected]> Cc: Yue Hu <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-10-04mm: shrinker: remove old APIsQi Zheng1-8/+0
Now no users are using the old APIs, just remove them. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Qi Zheng <[email protected]> Reviewed-by: Muchun Song <[email protected]> Cc: Abhinav Kumar <[email protected]> Cc: Alasdair Kergon <[email protected]> Cc: Alexander Viro <[email protected]> Cc: Alyssa Rosenzweig <[email protected]> Cc: Andreas Dilger <[email protected]> Cc: Andreas Gruenbacher <[email protected]> Cc: Anna Schumaker <[email protected]> Cc: Arnd Bergmann <[email protected]> Cc: Bob Peterson <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Carlos Llamas <[email protected]> Cc: Chandan Babu R <[email protected]> Cc: Chao Yu <[email protected]> Cc: Chris Mason <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Christian Koenig <[email protected]> Cc: Chuck Lever <[email protected]> Cc: Coly Li <[email protected]> Cc: Dai Ngo <[email protected]> Cc: Daniel Vetter <[email protected]> Cc: Daniel Vetter <[email protected]> Cc: "Darrick J. Wong" <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Dave Hansen <[email protected]> Cc: David Airlie <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: David Sterba <[email protected]> Cc: Dmitry Baryshkov <[email protected]> Cc: Gao Xiang <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Huang Rui <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jaegeuk Kim <[email protected]> Cc: Jani Nikula <[email protected]> Cc: Jan Kara <[email protected]> Cc: Jason Wang <[email protected]> Cc: Jeff Layton <[email protected]> Cc: Jeffle Xu <[email protected]> Cc: Joel Fernandes (Google) <[email protected]> Cc: Joonas Lahtinen <[email protected]> Cc: Josef Bacik <[email protected]> Cc: Juergen Gross <[email protected]> Cc: Kent Overstreet <[email protected]> Cc: Kirill Tkhai <[email protected]> Cc: Marijn Suijten <[email protected]> Cc: "Michael S. Tsirkin" <[email protected]> Cc: Mike Snitzer <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Muchun Song <[email protected]> Cc: Nadav Amit <[email protected]> Cc: Neil Brown <[email protected]> Cc: Oleksandr Tyshchenko <[email protected]> Cc: Olga Kornievskaia <[email protected]> Cc: Paul E. McKenney <[email protected]> Cc: Richard Weinberger <[email protected]> Cc: Rob Clark <[email protected]> Cc: Rob Herring <[email protected]> Cc: Rodrigo Vivi <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Sean Paul <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Cc: Song Liu <[email protected]> Cc: Stefano Stabellini <[email protected]> Cc: Steven Price <[email protected]> Cc: "Theodore Ts'o" <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Tomeu Vizoso <[email protected]> Cc: Tom Talpey <[email protected]> Cc: Trond Myklebust <[email protected]> Cc: Tvrtko Ursulin <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Xuan Zhuo <[email protected]> Cc: Yue Hu <[email protected]> Signed-off-by: Andrew Morton <[email protected]>