aboutsummaryrefslogtreecommitdiff
path: root/include/linux
AgeCommit message (Collapse)AuthorFilesLines
2023-12-12RDMA/mlx5: Support handling of SW encap ICM areaShun Hao1-0/+1
New type for this ICM area, now the user can allocate/deallocate the new type of SW encap ICM memory, to store the encap header data which are managed by SW. Signed-off-by: Shun Hao <[email protected]> Link: https://lore.kernel.org/r/546fe43fc700240709e30acf7713ec6834d652bd.1701871118.git.leon@kernel.org Signed-off-by: Leon Romanovsky <[email protected]>
2023-12-12net/mlx5: Introduce indirect-sw-encap ICM propertiesShun Hao1-2/+7
Add new fields for device memory capabilities, in order to support creation of new ICM memory type of SW encap. Signed-off-by: Shun Hao <[email protected]> Link: https://lore.kernel.org/r/107cca7dd6a932a1704abf6ebd1b801105546a8e.1701871118.git.leon@kernel.org Signed-off-by: Leon Romanovsky <[email protected]>
2023-12-11bpf: use bitfields for simple per-subprog bool flagsAndrii Nakryiko1-6/+6
We have a bunch of bool flags for each subprog. Instead of wasting bytes for them, use bitfields instead. Signed-off-by: Andrii Nakryiko <[email protected]> Acked-by: Eduard Zingerman <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-12-11bpf: tidy up exception callback management a bitAndrii Nakryiko1-1/+1
Use the fact that we are passing subprog index around and have a corresponding struct bpf_subprog_info in bpf_verifier_env for each subprogram. We don't need to separately pass around a flag whether subprog is exception callback or not, each relevant verifier function can determine this using provided subprog index if we maintain bpf_subprog_info properly. Also move out exception callback-specific logic from btf_prepare_func_args(), keeping it generic. We can enforce all these restriction right before exception callback verification pass. We add out parameter, arg_cnt, for now, but this will be unnecessary with subsequent refactoring and will be removed. Signed-off-by: Andrii Nakryiko <[email protected]> Acked-by: Eduard Zingerman <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-12-11rtnl: add helper to send if skb is not nullPedro Tammela1-0/+7
This is a convenience helper for routines handling conditional rtnl events, that is code that might send a notification depending on rtnl_has_listeners/rtnl_notify_needed. Instead of: if (skb) rtnetlink_send(...) Use: rtnetlink_maybe_send(...) Reviewed-by: Jiri Pirko <[email protected]> Reviewed-by: Simon Horman <[email protected]> Signed-off-by: Pedro Tammela <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2023-12-11rtnl: add helper to check if a notification is neededVictor Nogueira1-0/+15
Building on the rtnl_has_listeners helper, add the rtnl_notify_needed helper to check if we can bail out early in the notification routines. Reviewed-by: Jiri Pirko <[email protected]> Reviewed-by: Simon Horman <[email protected]> Signed-off-by: Victor Nogueira <[email protected]> Signed-off-by: Pedro Tammela <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2023-12-11rtnl: add helper to check if rtnl group has listenersJamal Hadi Salim1-0/+7
As of today, rtnl code creates a new skb and unconditionally fills and broadcasts it to the relevant group. For most operations this is okay and doesn't waste resources in general. When operations are done without the rtnl_lock, as in tc-flower, such skb allocation, message fill and no-op broadcasting can happen in all cores of the system, which contributes to system pressure and wastes precious cpu cycles when no one will receive the built message. Introduce this helper so rtnetlink operations can simply check if someone is listening and then proceed if necessary. Reviewed-by: Jiri Pirko <[email protected]> Reviewed-by: Simon Horman <[email protected]> Signed-off-by: Jamal Hadi Salim <[email protected]> Signed-off-by: Victor Nogueira <[email protected]> Signed-off-by: Pedro Tammela <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2023-12-12Backmerge tag 'v6.7-rc5' into drm-nextDave Airlie23-45/+94
Linux 6.7-rc5 Alex requested this for some amdkfd work relying on the symbols exports. Signed-off-by: Dave Airlie <[email protected]>
2023-12-12Merge tag 'gpio-remove-gpiochip_is_requested-for-v6.8-rc1' of ↵Linus Walleij1-9/+30
git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux into devel gpio: remove gpiochip_is_requested() - provide a safer alternative to gpiochip_is_requested() - convert all existing users - remove gpiochip_is_requested()
2023-12-12rcu: Remove unused macros from rcupdate.hPedro Falcato1-3/+0
ulong2long, USHORT_CMP_GE and USHORT_CMP_LT are redundant and have been unused for quite a few releases. Signed-off-by: Pedro Falcato <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]> Signed-off-by: Neeraj Upadhyay (AMD) <[email protected]>
2023-12-12rcu: Restrict access to RCU CPU stall notifiersPaul E. McKenney1-3/+3
Although the RCU CPU stall notifiers can be useful for dumping state when tracking down delicate forward-progress bugs where NUMA effects cause cache lines to be delivered to a given CPU regularly, but always in a state that prevents that CPU from making forward progress. These bugs can be detected by the RCU CPU stall-warning mechanism, but in some cases, the stall-warnings printk()s disrupt the forward-progress bug before any useful state can be obtained. Unfortunately, the notifier mechanism added by commit 5b404fdabacf ("rcu: Add RCU CPU stall notifier") can make matters worse if used at all carelessly. For example, if the stall warning was caused by a lock not being released, then any attempt to acquire that lock in the notifier will hang. This will prevent not only the notifier from producing any useful output, but it will also prevent the stall-warning message from ever appearing. This commit therefore hides this new RCU CPU stall notifier mechanism under a new RCU_CPU_STALL_NOTIFIER Kconfig option that depends on both DEBUG_KERNEL and RCU_EXPERT. In addition, the rcupdate.rcu_cpu_stall_notifiers=1 kernel boot parameter must also be specified. The RCU_CPU_STALL_NOTIFIER Kconfig option's help text contains a warning and explains the dangers of careless use, recommending lockless notifier code. In addition, a WARN() is triggered each time that an attempt is made to register a stall-warning notifier in kernels built with CONFIG_RCU_CPU_STALL_NOTIFIER=y. This combination of measures will keep use of this mechanism confined to debug kernels and away from routine deployments. [ paulmck: Apply Dan Carpenter feedback. ] Fixes: 5b404fdabacf ("rcu: Add RCU CPU stall notifier") Reported-by: Linus Torvalds <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]> Reviewed-by: Joel Fernandes (Google) <[email protected]> Signed-off-by: Neeraj Upadhyay (AMD) <[email protected]>
2023-12-11thermal: core: Make thermal_zone_device_unregister() return after freeing ↵Rafael J. Wysocki1-0/+2
the zone Make thermal_zone_device_unregister() wait until all of the references to the given thermal zone object have been dropped and free it before returning. This guarantees that when thermal_zone_device_unregister() returns, there is no leftover activity regarding the thermal zone in question which is required by some of its callers (for instance, modular driver code that wants to know when it is safe to let the module go away). Subsequently, this will allow some confusing device_is_registered() checks to be dropped from the thermal sysfs and core code. Signed-off-by: Rafael J. Wysocki <[email protected]> Reviewed-and-tested-by: Lukasz Luba <[email protected]> Acked-by: Daniel Lezcano <[email protected]>
2023-12-11iio: core: introduce trough info element for minimum valuesJavier Carrasco1-0/+1
The IIO_CHAN_INFO_PEAK info element is used for maximum values and currently there is no equivalent for minimum values. Instead of overloading the existing peak info element, a new info element can be added. In principle there is no need to add a _TROUGH_SCALE element as the scale will be the same as the one required for INFO_PEAK, which in turn is sometimes omitted if a single scale for peaks and raw values is required. Add an IIO_CHAN_INFO_TROUGH info element for minimum values. Signed-off-by: Javier Carrasco <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jonathan Cameron <[email protected]>
2023-12-11add statmount(2) syscallMiklos Szeredi1-0/+5
Add a way to query attributes of a single mount instead of having to parse the complete /proc/$PID/mountinfo, which might be huge. Lookup the mount the new 64bit mount ID. If a mount needs to be queried based on path, then statx(2) can be used to first query the mount ID belonging to the path. Design is based on a suggestion by Linus: "So I'd suggest something that is very much like "statfsat()", which gets a buffer and a length, and returns an extended "struct statfs" *AND* just a string description at the end." The interface closely mimics that of statx. Handle ASCII attributes by appending after the end of the structure (as per above suggestion). Pointers to strings are stored in u64 members to make the structure the same regardless of pointer size. Strings are nul terminated. Link: https://lore.kernel.org/all/CAHk-=wh5YifP7hzKSbwJj94+DZ2czjrZsczy6GBimiogZws=rg@mail.gmail.com/ Signed-off-by: Miklos Szeredi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Ian Kent <[email protected]> [Christian Brauner <[email protected]>: various minor changes] Signed-off-by: Christian Brauner <[email protected]>
2023-12-11PCI/ASPM: Add pci_enable_link_state_locked()Johan Hovold1-0/+3
Add pci_enable_link_state_locked() for enabling link states that can be used in contexts where a pci_bus_sem read lock is already held (e.g. from pci_walk_bus()). This helper will be used to fix a couple of potential deadlocks where the current helper is called with the lock already held, hence the CC stable tag. Fixes: f492edb40b54 ("PCI: vmd: Add quirk to configure PCIe ASPM and LTR") Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Johan Hovold <[email protected]> [bhelgaas: include helper name in subject, commit log] Signed-off-by: Bjorn Helgaas <[email protected]> Reviewed-by: Manivannan Sadhasivam <[email protected]> Cc: <[email protected]> # 6.3 Cc: Michael Bottini <[email protected]> Cc: David E. Box <[email protected]>
2023-12-11net, xdp: Allow metadata > 32Aleksander Lobakin1-5/+8
32 bytes may be not enough for some custom metadata. Relax the restriction, allow metadata larger than 32 bytes and make __skb_metadata_differs() work with bigger lengths. Now size of metadata is only limited by the fact it is stored as u8 in skb_shared_info, so maximum possible value is 255. Size still has to be aligned to 4, so the actual upper limit becomes 252. Most driver implementations will offer less, none can offer more. Other important conditions, such as having enough space for xdp_frame building, are already checked in bpf_xdp_adjust_meta(). Signed-off-by: Aleksander Lobakin <[email protected]> Signed-off-by: Larysa Zaremba <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Link: https://lore.kernel.org/bpf/[email protected] Link: https://lore.kernel.org/bpf/[email protected]
2023-12-11quota: convert dquot_claim_space_nodirty() to return voidChao Yu1-10/+5
dquot_claim_space_nodirty() always return zero, let's convert it to return void, then, its caller can get rid of handling failure case. Signed-off-by: Chao Yu <[email protected]> Signed-off-by: Jan Kara <[email protected]> Message-Id: <[email protected]>
2023-12-11soc: mediatek: Support MT8188 VDOSYS1 Padding in mtk-mmsysHsiao Chien Sung1-0/+8
- Add Padding components - Add Mutex module definitions for Padding Reviewed-by: AngeloGioacchino Del Regno <[email protected]> Signed-off-by: Hsiao Chien Sung <[email protected]> Signed-off-by: AngeloGioacchino Del Regno <[email protected]>
2023-12-11platform/x86/amd: Add support for AMD ACPI based Wifi band RFI mitigation ↵Ma Jun1-0/+91
feature Due to electrical and mechanical constraints in certain platform designs there may be likely interference of relatively high-powered harmonics of the (G-)DDR memory clocks with local radio module frequency bands used by Wifi 6/6e/7. To mitigate this, AMD has introduced a mechanism that devices can use to notify active use of particular frequencies so that other devices can make relative internal adjustments as necessary to avoid this resonance. Co-developed-by: Evan Quan <[email protected]> Signed-off-by: Evan Quan <[email protected]> Signed-off-by: Ma Jun <[email protected]> Reviewed-by: Mario Limonciello <[email protected]> Reviewed-by: Hans de Goede <[email protected]> Signed-off-by: Hans de Goede <[email protected]>
2023-12-11platform/x86: wmi: Remove chardev interfaceArmin Wolf1-8/+0
The design of the WMI chardev interface is broken: - it assumes that WMI drivers are not instantiated twice - it offers next to no abstractions, the WMI driver gets a raw byte buffer - it is only used by a single driver, something which is unlikely to change Since the only user (dell-smbios-wmi) has been migrated to his own ioctl interface, remove it. Reviewed-by: Ilpo Järvinen <[email protected]> Signed-off-by: Armin Wolf <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Hans de Goede <[email protected]> Signed-off-by: Hans de Goede <[email protected]>
2023-12-11Merge tag 'platform-drivers-x86-v6.7-3' into pdx86/for-nextHans de Goede1-0/+3
Back merge pdx86 fixes into pdx86/for-next for further WMI work depending on some of the fixes. platform-drivers-x86 for v6.7-3 Highlights: - asus-wmi: Solve i8042 filter resource handling, input, and suspend issues - wmi: Skip zero instance WMI blocks to avoid issues with some laptops - mlxbf-bootctl: Differentiate dev/production keys - platform/surface: Correct serdev related return value to avoid leaking errno into userspace - Error checking fixes The following is an automated shortlog grouped by driver: asus-wmi: - Change q500a_i8042_filter() into a generic i8042-filter - disable USB0 hub on ROG Ally before suspend - Filter Volume key presses if also reported via atkbd - Move i8042 filter install to shared asus-wmi code mellanox: - Add null pointer checks for devm_kasprintf() - Check devm_hwmon_device_register_with_groups() return value mlxbf-bootctl: - correctly identify secure boot with development keys surface: aggregator: - fix recv_buf() return value wmi: - Skip blocks with zero instances
2023-12-11efivarfs: automatically update super block flagMasahisa Kojima1-0/+8
efivar operation is updated when the tee_stmm_efi module is probed. tee_stmm_efi module supports SetVariable runtime service, but user needs to manually remount the efivarfs as RW to enable the write access if the previous efivar operation does not support SetVariable and efivarfs is mounted as read-only. This commit notifies the update of efivar operation to efivarfs subsystem, then drops SB_RDONLY flag if the efivar operation supports SetVariable. Signed-off-by: Masahisa Kojima <[email protected]> [ardb: use per-superblock instance of the notifier block] Signed-off-by: Ard Biesheuvel <[email protected]>
2023-12-11efi: Add EFI_ACCESS_DENIED status codeMasahisa Kojima1-0/+1
This commit adds the EFI_ACCESS_DENIED status code. Acked-by: Sumit Garg <[email protected]> Co-developed-by: Ilias Apalodimas <[email protected]> Signed-off-by: Ilias Apalodimas <[email protected]> Signed-off-by: Masahisa Kojima <[email protected]> Signed-off-by: Ard Biesheuvel <[email protected]>
2023-12-11efi: expose efivar generic ops register functionMasahisa Kojima1-0/+3
This is a preparation for supporting efivar operations provided by other than efi subsystem. Both register and unregister functions are exposed so that non-efi subsystem can revert the efi generic operation. Acked-by: Sumit Garg <[email protected]> Co-developed-by: Ilias Apalodimas <[email protected]> Signed-off-by: Ilias Apalodimas <[email protected]> Signed-off-by: Masahisa Kojima <[email protected]> Signed-off-by: Ard Biesheuvel <[email protected]>
2023-12-11platform/x86/intel/tpmi: Move TPMI ID definitionSrinivas Pandruvada1-0/+13
Move TPMI ID definitions to common include file. In this way other feature drivers don't have to redefine. Signed-off-by: Srinivas Pandruvada <[email protected]> Reviewed-by: Ilpo Järvinen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Hans de Goede <[email protected]>
2023-12-11platform/x86/intel/tpmi: Modify external interface to get read/write stateSrinivas Pandruvada1-3/+2
Modify the external interface tpmi_get_feature_status() to get read and write blocked instead of locked and disabled. Since auxiliary device is not created when disabled, no use of returning disabled state. Also locked state is not useful as feature driver can't use locked state in a meaningful way. Using read and write state, feature driver can decide which operations to restrict for that feature. Signed-off-by: Srinivas Pandruvada <[email protected]> Reviewed-by: Ilpo Järvinen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Hans de Goede <[email protected]>
2023-12-11clk: x86: lpss-atom: Drop unneeded 'extern' in the headerAndy Shevchenko1-1/+1
'extern' for the functions is not needed, drop it. Signed-off-by: Andy Shevchenko <[email protected]> Reviewed-by: Ilpo Järvinen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Hans de Goede <[email protected]> Signed-off-by: Hans de Goede <[email protected]>
2023-12-11Merge 6.7-rc5 into tty-nextGreg Kroah-Hartman29-63/+156
We need the serial fixes in here as well to build off of. Signed-off-by: Greg Kroah-Hartman <[email protected]>
2023-12-11Merge 6.7-rc5 into usb-nextGreg Kroah-Hartman23-45/+94
We need the USB fixes in here as well to build off of. Signed-off-by: Greg Kroah-Hartman <[email protected]>
2023-12-11Merge 6.7-rc5 into char-misc-nextGreg Kroah-Hartman23-45/+94
We need the char/misc fixes in here as well for testing and to build off of. Signed-off-by: Greg Kroah-Hartman <[email protected]>
2023-12-10resource: add walk_system_ram_res_rev()Baoquan He1-0/+3
This function, being a variant of walk_system_ram_res() introduced in commit 8c86e70acead ("resource: provide new functions to walk through resources"), walks through a list of all the resources of System RAM in reversed order, i.e., from higher to lower. It will be used in kexec_file code to load kernel, initrd etc when preparing kexec reboot. Link: https://lkml.kernel.org/r/ZVTA6z/06cLnWKUz@MiWiFi-R3L-srv Signed-off-by: AKASHI Takahiro <[email protected]> Signed-off-by: Baoquan He <[email protected]> Cc: Eric Biederman <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-10kernel/signal.c: simplify force_sig_info_to_task(), kill ↵Oleg Nesterov1-1/+0
recalc_sigpending_and_wake() The purpose of recalc_sigpending_and_wake() is not clear, it looks "obviously unneeded" because we are going to send the signal which can't be blocked or ignored. Add the comment to explain why we can't rely on send_signal_locked() and make this logic more simple/explicit. recalc_sigpending_and_wake() has no other users, it can die. In fact I think we don't even need signal_wake_up(), the target task must be either current or a TASK_TRACED child, otherwise the usage of siglock is not safe. But this needs another change. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Oleg Nesterov <[email protected]> Cc: Eric Biederman <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-10arch: remove ARCH_TASK_STRUCT_ON_STACKHeiko Carstens2-9/+0
IA-64 was the only architecture which selected ARCH_TASK_STRUCT_ON_STACK. IA-64 was removed with commit cf8e8658100d ("arch: Remove Itanium (IA-64) architecture"). Therefore remove support for ARCH_TASK_STRUCT_ON_STACK as well. Note: this also reveals a potential bug in powerpc code, which makes use of __init_task_data without selecting ARCH_TASK_STRUCT_ON_STACK which makes __init_task_data a no-op. This is broken since commit d11ed3ab3166 ("Expand INIT_TASK() in init/init_task.c and remove") from 2018 and needs to be addressed separately. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Heiko Carstens <[email protected]> Reviewed-by: Arnd Bergmann <[email protected]> Cc: Michael Ellerman <[email protected]> Cc: Nicholas Piggin <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-10introduce for_other_threads(p, t)Oleg Nesterov1-0/+3
Cosmetic, but imho it makes the usage look more clear and simple, the new helper doesn't require to initialize "t". After this change while_each_thread() has only 3 users, and it is only used in the do/while loops. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Oleg Nesterov <[email protected]> Reviewed-by: Christian Brauner <[email protected]> Cc: "Eric W. Biederman" <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-10mm: list_lru: Update kernel documentation to follow the requirementsAndy Shevchenko1-17/+19
kernel-doc is not happy about documentation in list_lru.h: list_lru.h:90: warning: Function parameter or member 'lru' not described in 'list_lru_add' list_lru.h:90: warning: Excess function parameter 'list_lru' description in 'list_lru_add' list_lru.h:90: warning: No description found for return value of 'list_lru_add' list_lru.h:103: warning: Function parameter or member 'lru' not described in 'list_lru_del' list_lru.h:103: warning: Excess function parameter 'list_lru' description in 'list_lru_del' list_lru.h:103: warning: No description found for return value of 'list_lru_del' list_lru.h:116: warning: No description found for return value of 'list_lru_count_one' list_lru.h:168: warning: No description found for return value of 'list_lru_walk_one' list_lru.h:185: warning: No description found for return value of 'list_lru_walk_one_irq' Fix the documentation accordingly. While at it, fix the references to the parameters in functions inside the long descriptions, on which the above script is not complaining (yet?). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Andy Shevchenko <[email protected]> Cc: Rasmus Villemoes <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-10pgtable: rename ptdesc _refcount field to __page_refcountAlexander Gordeev1-3/+3
Rename ptdesc _refcount field to __page_refcount similar to the other unused page fields. Link: https://lkml.kernel.org/r/982bdc652ba79a606c3d01c905766e7e076b3315.1700594815.git.agordeev@linux.ibm.com Signed-off-by: Alexander Gordeev <[email protected]> Suggested-by: Vishal Moola <[email protected]> Cc: Gerald Schaefer <[email protected]> Cc: Heiko Carstens <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-10pgtable: fix s390 ptdesc field commentsAlexander Gordeev1-2/+2
Patch series "minor ptdesc updates", v3. This patch (of 2): Since commit d08d4e7cd6bf ("s390/mm: use full 4KB page for 2KB PTE") there is no fragmented page tracking on s390. Fix the corresponding comments. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/2eead241f3a45bed26c7911cf66bded1e35670b8.1700594815.git.agordeev@linux.ibm.com Signed-off-by: Alexander Gordeev <[email protected]> Suggested-by: Heiko Carstens <[email protected]> Cc: Gerald Schaefer <[email protected]> Cc: Vishal Moola (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-10mm: use vmem_altmap code without CONFIG_ZONE_DEVICESumanth Korikkar2-12/+26
vmem_altmap_free() and vmem_altmap_offset() could be utlized without CONFIG_ZONE_DEVICE enabled. For example, mm/memory_hotplug.c:__add_pages() relies on that. The altmap is no longer restricted to ZONE_DEVICE handling, but instead depends on CONFIG_SPARSEMEM_VMEMMAP. When CONFIG_SPARSEMEM_VMEMMAP is disabled, these functions are defined as inline stubs, ensuring compatibility with configurations that do not use sparsemem vmemmap. Without it, lkp reported the following: ld: arch/x86/mm/init_64.o: in function `remove_pagetable': init_64.c:(.meminit.text+0xfc7): undefined reference to `vmem_altmap_free' Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Sumanth Korikkar <[email protected]> Reported-by: kernel test robot <[email protected]> Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/ Reviewed-by: Gerald Schaefer <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Alexander Gordeev <[email protected]> Cc: Aneesh Kumar K.V <[email protected]> Cc: Anshuman Khandual <[email protected]> Cc: Heiko Carstens <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Vasily Gorbik <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-10lib/stackdepot: allow users to evict stack tracesAndrey Konovalov1-0/+14
Add stack_depot_put, a function that decrements the reference counter on a stack record and removes it from the stack depot once the counter reaches 0. Internally, when removing a stack record, the function unlinks it from the hash table bucket and returns to the freelist. With this change, the users of stack depot can call stack_depot_put when keeping a stack trace in the stack depot is not needed anymore. This allows avoiding polluting the stack depot with irrelevant stack traces and thus have more space to store the relevant ones before the stack depot reaches its capacity. Link: https://lkml.kernel.org/r/1d1ad5692ee43d4fc2b3fd9d221331d30b36123f.1700502145.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Evgenii Stepanov <[email protected]> Cc: Marco Elver <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-10lib/stackdepot: add refcount for recordsAndrey Konovalov1-3/+10
Add a reference counter for how many times a stack records has been added to stack depot. Add a new STACK_DEPOT_FLAG_GET flag to stack_depot_save_flags that instructs the stack depot to increment the refcount. Do not yet decrement the refcount; this is implemented in one of the following patches. Do not yet enable any users to use the flag to avoid overflowing the refcount. This is preparatory patch for implementing the eviction of stack records from the stack depot. Link: https://lkml.kernel.org/r/a3fc14a2359d019d2a008d4ff8b46a665371ffee.1700502145.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <[email protected]> Reviewed-by: Alexander Potapenko <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Evgenii Stepanov <[email protected]> Cc: Marco Elver <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-10lib/stackdepot, kasan: add flags to __stack_depot_save and renameAndrey Konovalov1-11/+25
Change the bool can_alloc argument of __stack_depot_save to a u32 argument that accepts a set of flags. The following patch will add another flag to stack_depot_save_flags besides the existing STACK_DEPOT_FLAG_CAN_ALLOC. Also rename the function to stack_depot_save_flags, as __stack_depot_save is a cryptic name, Link: https://lkml.kernel.org/r/645fa15239621eebbd3a10331e5864b718839512.1700502145.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <[email protected]> Reviewed-by: Alexander Potapenko <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Evgenii Stepanov <[email protected]> Cc: Marco Elver <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-10fs: convert error_remove_page to error_remove_folioMatthew Wilcox (Oracle)2-2/+3
There were already assertions that we were not passing a tail page to error_remove_page(), so make the compiler enforce that by converting everything to pass and use a folio. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Cc: Naoya Horiguchi <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-10gfp: include __GFP_NOWARN in GFP_NOWAITMatthew Wilcox (Oracle)1-2/+3
GFP_NOWAIT callers are always prepared for their allocations to fail because they fail so frequently. Forcing the callers to remember to add __GFP_NOWARN is just annoying and leads to an endless stream of patches for the places where we forgot to add it. We can now remove __GFP_NOWARN from all the callers which specify GFP_NOWAIT, but I'd rather wait a cycle and send patches to each maintainer instead of creating a big pile of merge conflicts. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-10mm: return void from folio_start_writeback() and related functionsMatthew Wilcox (Oracle)1-2/+2
Nobody now checks the return value from any of these functions, so add an assertion at the beginning of the function and return void. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: Josef Bacik <[email protected]> Cc: David Howells <[email protected]> Cc: Steve French <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-10mm: remove test_set_page_writeback()Matthew Wilcox (Oracle)1-5/+0
Patch series "Make folio_start_writeback return void". Most of the folio flag-setting functions return void. folio_start_writeback is gratuitously different; the only two filesystems that do anything with the return value emit debug messages if it's already set, and we can (and should) do that internally without bothering the filesystem to do it. This patch (of 4): There are no more callers of this wrapper. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Cc: David Howells <[email protected]> Cc: Steve French <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-10mm: add folio_fill_tail() and use it in iomapMatthew Wilcox (Oracle)1-0/+38
The iomap code was limited to PAGE_SIZE bytes; generalise it to cover an arbitrary-sized folio, and move it to be a common helper. [[email protected]: fix folio_fill_tail(), per Andreas Gruenbacher] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: Andreas Gruenbacher <[email protected]> Cc: Andreas Dilger <[email protected]> Cc: Darrick J. Wong <[email protected]> Cc: Theodore Ts'o <[email protected]> Cc: Andreas Gruenbacher <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-10mm: add folio_zero_tail() and use it in ext4Matthew Wilcox (Oracle)1-0/+38
Patch series "Add folio_zero_tail() and folio_fill_tail()". I'm trying to make it easier for filesystems with tailpacking / stuffing / inline data to use folios. The primary function here is folio_fill_tail(). You give it a pointer to memory where the data currently is, and it takes care of copying it into the folio at that offset. That works for gfs2 & iomap. Then There's Ext4. Rather than gin up some kind of specialist "Here's a two pointers to two blocks of memory" routine, just let it do its current thing, and let it call folio_zero_tail(), which is also called by folio_fill_tail(). Other filesystems can be converted later; these ones seemed like good examples as they're already partly or completely converted to folios. This patch (of 3): Instead of unmapping the folio after copying the data to it, then mapping it again to zero the tail, provide folio_zero_tail() to zero the tail of an already-mapped folio. [[email protected]: fix kerneldoc argument ordering] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: Andreas Gruenbacher <[email protected]> Cc: Darrick J. Wong <[email protected]> Cc: Theodore Ts'o <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-10NUMA: optimize detection of memory with no node id assigned by firmwareLiam Ni1-0/+1
Sanity check that makes sure the nodes cover all memory loops over numa_meminfo to count the pages that have node id assigned by the firmware, then loops again over memblock.memory to find the total amount of memory and in the end checks that the difference between the total memory and memory that covered by nodes is less than some threshold. Worse, the loop over numa_meminfo calls __absent_pages_in_range() that also partially traverses memblock.memory. It's much simpler and more efficient to have a single traversal of memblock.memory that verifies that amount of memory not covered by nodes is less than a threshold. Introduce memblock_validate_numa_coverage() that does exactly that and use it instead of numa_meminfo_cover_memory(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Liam Ni <[email protected]> Reviewed-by: Mike Rapoport (IBM) <[email protected]> Cc: Andy Lutomirski <[email protected]> Cc: Bibo Mao <[email protected]> Cc: Binbin Zhou <[email protected]> Cc: Borislav Petkov <[email protected]> Cc: Dave Hansen <[email protected]> Cc: Feiyang Chen <[email protected]> Cc: "H. Peter Anvin" <[email protected]> Cc: Huacai Chen <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: WANG Xuerui <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-10fork: use __mt_dup() to duplicate maple tree in dup_mmap()Peng Zhang1-0/+11
In dup_mmap(), using __mt_dup() to duplicate the old maple tree and then directly replacing the entries of VMAs in the new maple tree can result in better performance. __mt_dup() uses DFS pre-order to duplicate the maple tree, so it is efficient. The average time complexity of __mt_dup() is O(n), where n is the number of VMAs. The proof of the time complexity is provided in the commit log that introduces __mt_dup(). After duplicating the maple tree, each element is traversed and replaced (ignoring the cases of deletion, which are rare). Since it is only a replacement operation for each element, this process is also O(n). Analyzing the exact time complexity of the previous algorithm is challenging because each insertion can involve appending to a node, pushing data to adjacent nodes, or even splitting nodes. The frequency of each action is difficult to calculate. The worst-case scenario for a single insertion is when the tree undergoes splitting at every level. If we consider each insertion as the worst-case scenario, we can determine that the upper bound of the time complexity is O(n*log(n)), although this is a loose upper bound. However, based on the test data, it appears that the actual time complexity is likely to be O(n). As the entire maple tree is duplicated using __mt_dup(), if dup_mmap() fails, there will be a portion of VMAs that have not been duplicated in the maple tree. To handle this, we mark the failure point with XA_ZERO_ENTRY. In exit_mmap(), if this marker is encountered, stop releasing VMAs that have not been duplicated after this point. There is a "spawn" in byte-unixbench[1], which can be used to test the performance of fork(). I modified it slightly to make it work with different number of VMAs. Below are the test results. The first row shows the number of VMAs. The second and third rows show the number of fork() calls per ten seconds, corresponding to next-20231006 and the this patchset, respectively. The test results were obtained with CPU binding to avoid scheduler load balancing that could cause unstable results. There are still some fluctuations in the test results, but at least they are better than the original performance. 21 121 221 421 821 1621 3221 6421 12821 25621 51221 112100 76261 54227 34035 20195 11112 6017 3161 1606 802 393 114558 83067 65008 45824 28751 16072 8922 4747 2436 1233 599 2.19% 8.92% 19.88% 34.64% 42.37% 44.64% 48.28% 50.17% 51.68% 53.74% 52.42% [1] https://github.com/kdlucas/byte-unixbench/tree/master Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Peng Zhang <[email protected]> Suggested-by: Liam R. Howlett <[email protected]> Reviewed-by: Liam R. Howlett <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Mateusz Guzik <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Michael S. Tsirkin <[email protected]> Cc: Mike Christie <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-12-10maple_tree: introduce interfaces __mt_dup() and mtree_dup()Peng Zhang1-0/+3
Introduce interfaces __mt_dup() and mtree_dup(), which are used to duplicate a maple tree. They duplicate a maple tree in Depth-First Search (DFS) pre-order traversal. It uses memcopy() to copy nodes in the source tree and allocate new child nodes in non-leaf nodes. The new node is exactly the same as the source node except for all the addresses stored in it. It will be faster than traversing all elements in the source tree and inserting them one by one into the new tree. The time complexity of these two functions is O(n). The difference between __mt_dup() and mtree_dup() is that mtree_dup() handles locks internally. Analysis of the average time complexity of this algorithm: For simplicity, let's assume that the maximum branching factor of all non-leaf nodes is 16 (in allocation mode, it is 10), and the tree is a full tree. Under the given conditions, if there is a maple tree with n elements, the number of its leaves is n/16. From bottom to top, the number of nodes in each level is 1/16 of the number of nodes in the level below. So the total number of nodes in the entire tree is given by the sum of n/16 + n/16^2 + n/16^3 + ... + 1. This is a geometric series, and it has log(n) terms with base 16. According to the formula for the sum of a geometric series, the sum of this series can be calculated as (n-1)/15. Each node has only one parent node pointer, which can be considered as an edge. In total, there are (n-1)/15-1 edges. This algorithm consists of two operations: 1. Traversing all nodes in DFS order. 2. For each node, making a copy and performing necessary modifications to create a new node. For the first part, DFS traversal will visit each edge twice. Let T(ascend) represent the cost of taking one step downwards, and T(descend) represent the cost of taking one step upwards. And both of them are constants (although mas_ascend() may not be, as it contains a loop, but here we ignore it and treat it as a constant). So the time spent on the first part can be represented as ((n-1)/15-1) * (T(ascend) + T(descend)). For the second part, each node will be copied, and the cost of copying a node is denoted as T(copy_node). For each non-leaf node, it is necessary to reallocate all child nodes, and the cost of this operation is denoted as T(dup_alloc). The behavior behind memory allocation is complex and not specific to the maple tree operation. Here, we assume that the time required for a single allocation is constant. Since the size of a node is fixed, both of these symbols are also constants. We can calculate that the time spent on the second part is ((n-1)/15) * T(copy_node) + ((n-1)/15 - n/16) * T(dup_alloc). Adding both parts together, the total time spent by the algorithm can be represented as: ((n-1)/15) * (T(ascend) + T(descend) + T(copy_node) + T(dup_alloc)) - n/16 * T(dup_alloc) - (T(ascend) + T(descend)) Let C1 = T(ascend) + T(descend) + T(copy_node) + T(dup_alloc) Let C2 = T(dup_alloc) Let C3 = T(ascend) + T(descend) Finally, the expression can be simplified as: ((16 * C1 - 15 * C2) / (15 * 16)) * n - (C1 / 15 + C3). This is a linear function, so the average time complexity is O(n). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Peng Zhang <[email protected]> Suggested-by: Liam R. Howlett <[email protected]> Cc: Christian Brauner <[email protected]> Cc: Jonathan Corbet <[email protected]> Cc: Mateusz Guzik <[email protected]> Cc: Mathieu Desnoyers <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Michael S. Tsirkin <[email protected]> Cc: Mike Christie <[email protected]> Cc: Nicholas Piggin <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>