aboutsummaryrefslogtreecommitdiff
path: root/include/linux
AgeCommit message (Collapse)AuthorFilesLines
2024-04-29Merge tag 'for-netdev' of ↵Jakub Kicinski5-5/+86
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2024-04-29 We've added 147 non-merge commits during the last 32 day(s) which contain a total of 158 files changed, 9400 insertions(+), 2213 deletions(-). The main changes are: 1) Add an internal-only BPF per-CPU instruction for resolving per-CPU memory addresses and implement support in x86 BPF JIT. This allows inlining per-CPU array and hashmap lookups and the bpf_get_smp_processor_id() helper, from Andrii Nakryiko. 2) Add BPF link support for sk_msg and sk_skb programs, from Yonghong Song. 3) Optimize x86 BPF JIT's emit_mov_imm64, and add support for various atomics in bpf_arena which can be JITed as a single x86 instruction, from Alexei Starovoitov. 4) Add support for passing mark with bpf_fib_lookup helper, from Anton Protopopov. 5) Add a new bpf_wq API for deferring events and refactor sleepable bpf_timer code to keep common code where possible, from Benjamin Tissoires. 6) Fix BPF_PROG_TEST_RUN infra with regards to bpf_dummy_struct_ops programs to check when NULL is passed for non-NULLable parameters, from Eduard Zingerman. 7) Harden the BPF verifier's and/or/xor value tracking, from Harishankar Vishwanathan. 8) Introduce crypto kfuncs to make BPF programs able to utilize the kernel crypto subsystem, from Vadim Fedorenko. 9) Various improvements to the BPF instruction set standardization doc, from Dave Thaler. 10) Extend libbpf APIs to partially consume items from the BPF ringbuffer, from Andrea Righi. 11) Bigger batch of BPF selftests refactoring to use common network helpers and to drop duplicate code, from Geliang Tang. 12) Support bpf_tail_call_static() helper for BPF programs with GCC 13, from Jose E. Marchesi. 13) Add bpf_preempt_{disable,enable}() kfuncs in order to allow a BPF program to have code sections where preemption is disabled, from Kumar Kartikeya Dwivedi. 14) Allow invoking BPF kfuncs from BPF_PROG_TYPE_SYSCALL programs, from David Vernet. 15) Extend the BPF verifier to allow different input maps for a given bpf_for_each_map_elem() helper call in a BPF program, from Philo Lu. 16) Add support for PROBE_MEM32 and bpf_addr_space_cast instructions for riscv64 and arm64 JITs to enable BPF Arena, from Puranjay Mohan. 17) Shut up a false-positive KMSAN splat in interpreter mode by unpoison the stack memory, from Martin KaFai Lau. 18) Improve xsk selftest coverage with new tests on maximum and minimum hardware ring size configurations, from Tushar Vyavahare. 19) Various ReST man pages fixes as well as documentation and bash completion improvements for bpftool, from Rameez Rehman & Quentin Monnet. 20) Fix libbpf with regards to dumping subsequent char arrays, from Quentin Deslandes. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (147 commits) bpf, docs: Clarify PC use in instruction-set.rst bpf_helpers.h: Define bpf_tail_call_static when building with GCC bpf, docs: Add introduction for use in the ISA Internet Draft selftests/bpf: extend BPF_SOCK_OPS_RTT_CB test for srtt and mrtt_us bpf: add mrtt and srtt as BPF_SOCK_OPS_RTT_CB args selftests/bpf: dummy_st_ops should reject 0 for non-nullable params bpf: check bpf_dummy_struct_ops program params for test runs selftests/bpf: do not pass NULL for non-nullable params in dummy_st_ops selftests/bpf: adjust dummy_st_ops_success to detect additional error bpf: mark bpf_dummy_struct_ops.test_1 parameter as nullable selftests/bpf: Add ring_buffer__consume_n test. bpf: Add bpf_guard_preempt() convenience macro selftests: bpf: crypto: add benchmark for crypto functions selftests: bpf: crypto skcipher algo selftests bpf: crypto: add skcipher to bpf crypto bpf: make common crypto API for TC/XDP programs bpf: update the comment for BTF_FIELDS_MAX selftests/bpf: Fix wq test. selftests/bpf: Use make_sockaddr in test_sock_addr selftests/bpf: Use connect_to_addr in test_sock_addr ... ==================== Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2024-04-29iio: backend: add API for interface tuningNuno Sa1-0/+36
This is in preparation for supporting interface tuning in one for the devices using the axi-adc backend. The new added interfaces are all needed for that calibration: * iio_backend_test_pattern_set(); * iio_backend_chan_status(); * iio_backend_iodelay_set(); * iio_backend_data_sample_trigger(). Interface tuning is the process of going through a set of known points (typically by the frontend), change some clk or data delays (or both) and send/receive some known signal (so called test patterns in this change). The receiving end (either frontend or the backend) is responsible for validating the signal and see if it's good or not. The goal for all of this is to come up with ideal delays at the data interface level so we can have a proper, more reliable data transfer. Also note that for some devices we can change the sampling rate (which typically means changing some reference clock) and that can affect the data interface. In that case, it's import to run the tuning algorithm again as the values we had before may no longer be the best (or even valid) ones. Signed-off-by: Nuno Sa <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jonathan Cameron <[email protected]>
2024-04-29iio: backend: change docs paddingNuno Sa1-19/+19
Using tabs and maintaining the start of the docs aligned is a pain and may lead to lot's of unrelated changes when adding new members. Hence, let#s change things now and just have a simple space after the member name. Signed-off-by: Nuno Sa <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jonathan Cameron <[email protected]>
2024-04-29iio: core: Add iio_read_acpi_mount_matrix() helper functionHans de Goede1-0/+13
The ACPI "ROTM" rotation matrix parsing code atm is already duplicated between bmc150-accel-core.c and kxcjk-1013.c and a third user of this is coming. Add an iio_read_acpi_mount_matrix() helper function for this. The 2 existing copies of the code are identical, except that the kxcjk-1013.c has slightly better error logging. To new helper is a 1:1 copy of the kxcjk-1013.c version, the only change is the addition of a "char *acpi_method" parameter since some bmc150 dual-accel setups (360° hinges with 1 accel in kbd/base + 1 in display) declare both accels in a single ACPI device with 2 different method names for the 2 matrices. This new acpi_method parameter is not "const char *" because the pathname parameter to acpi_evaluate_object() is not const. The 2 existing copies of this function will be removed in further patches in this series. Acked-by: Rafael J. Wysocki <[email protected]> Signed-off-by: Hans de Goede <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jonathan Cameron <[email protected]>
2024-04-29Merge branch 'vfio' of ↵Alex Williamson1-0/+31
https://git.kernel.org/pub/scm/linux/kernel/git/herbert/cryptodev-2.6 into v6.10/vfio/qat-v7
2024-04-29Merge v6.9-rc6 into drm-nextDaniel Vetter8-89/+182
Thomas needs the defio fixes, Maíra needs the vkms fixes and Joonas has some fun with i915-gem conflicts. Signed-off-by: Daniel Vetter <[email protected]>
2024-04-30ASoC: Merge up fixesMark Brown14-30/+133
Some new SOF changes depend on the fixes there.
2024-04-29netfs: Make netfs_io_request::subreq_counter an atomic_tDavid Howells1-1/+1
Make the netfs_io_request::subreq_counter, used to generate values for netfs_io_subrequest::debug_index, into an atomic_t so that it can be called from the retry thread at the same time as the app thread issuing writes. Signed-off-by: David Howells <[email protected]> Reviewed-by: Jeff Layton <[email protected]> cc: [email protected] cc: [email protected]
2024-04-29mm: Remove the PG_fscache alias for PG_private_2David Howells1-76/+4
Remove the PG_fscache alias for PG_private_2 and use the latter directly. Use of this flag for marking pages undergoing writing to the cache should be considered deprecated and the folios should be marked dirty instead and the write done in ->writepages(). Note that PG_private_2 itself should be considered deprecated and up for future removal by the MM folks too. Signed-off-by: David Howells <[email protected]> Reviewed-by: Jeff Layton <[email protected]> cc: Matthew Wilcox (Oracle) <[email protected]> cc: Ilya Dryomov <[email protected]> cc: Xiubo Li <[email protected]> cc: Steve French <[email protected]> cc: Paulo Alcantara <[email protected]> cc: Ronnie Sahlberg <[email protected]> cc: Shyam Prasad N <[email protected]> cc: Tom Talpey <[email protected]> cc: Bharath SM <[email protected]> cc: Trond Myklebust <[email protected]> cc: Anna Schumaker <[email protected]> cc: [email protected] cc: [email protected] cc: [email protected] cc: [email protected] cc: [email protected] cc: [email protected]
2024-04-29netfs: Replace PG_fscache by setting folio->private and marking dirtyDavid Howells2-11/+31
When dirty data is being written to the cache, setting/waiting on/clearing the fscache flag is always done in tandem with setting/waiting on/clearing the writeback flag. The netfslib buffered write routines wait on and set both flags and the write request cleanup clears both flags, so the fscache flag is almost superfluous. The reason it isn't superfluous is because the fscache flag is also used to indicate that data just read from the server is being written to the cache. The flag is used to prevent a race involving overlapping direct-I/O writes to the cache. Change this to indicate that a page is in need of being copied to the cache by placing a magic value in folio->private and marking the folios dirty. Then when the writeback code sees a folio marked in this way, it only writes it to the cache and not to the server. If a folio that has this magic value set is modified, the value is just replaced and the folio will then be uplodaded too. With this, PG_fscache is no longer required by the netfslib core, 9p and afs. Ceph and nfs, however, still need to use the old PG_fscache-based tracking. To deal with this, a flag, NETFS_ICTX_USE_PGPRIV2, now has to be set on the flags in the netfs_inode struct for those filesystems. This reenables the use of PG_fscache in that inode. 9p and afs use the netfslib write helpers so get switched over; cifs, for the moment, does page-by-page manual access to the cache, so doesn't use PG_fscache and is unaffected. Signed-off-by: David Howells <[email protected]> Reviewed-by: Jeff Layton <[email protected]> cc: Matthew Wilcox (Oracle) <[email protected]> cc: Eric Van Hensbergen <[email protected]> cc: Latchesar Ionkov <[email protected]> cc: Dominique Martinet <[email protected]> cc: Christian Schoenebeck <[email protected]> cc: Marc Dionne <[email protected]> cc: Ilya Dryomov <[email protected]> cc: Xiubo Li <[email protected]> cc: Steve French <[email protected]> cc: Paulo Alcantara <[email protected]> cc: Ronnie Sahlberg <[email protected]> cc: Shyam Prasad N <[email protected]> cc: Tom Talpey <[email protected]> cc: Bharath SM <[email protected]> cc: Trond Myklebust <[email protected]> cc: Anna Schumaker <[email protected]> cc: [email protected] cc: [email protected] cc: [email protected] cc: [email protected] cc: [email protected] cc: [email protected] cc: [email protected] cc: [email protected]
2024-04-29platform/x86/intel/tpmi: Add additional TPMI header fieldsSrinivas Pandruvada1-0/+6
TPMI information header added additional fields in version 2. Some of the reserved fields in version 1 are used to define new fields. Parse new fields and export as part of platform data. These fields include: - PCI segment ID - Partition ID of the package: If a package is represented by more than one PCI device, then partition ID along with cdie_mask, describes the scope. For example to update get/set properties for a compute die, one of the PCI MMIO region is selected from the partition ID. - cdie_mask: Mask of all compute dies in this partition. Signed-off-by: Srinivas Pandruvada <[email protected]> Reviewed-by: Andy Shevchenko <[email protected]> Reviewed-by: Zhang Rui <[email protected]> Reviewed-by: Ilpo Järvinen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Hans de Goede <[email protected]> Signed-off-by: Hans de Goede <[email protected]>
2024-04-29platform/x86/intel/tpmi: Align comments in kernel-docSrinivas Pandruvada1-3/+3
Align comments in kernel-doc for the struct intel_tpmi_plat_info. Signed-off-by: Srinivas Pandruvada <[email protected]> Reviewed-by: Ilpo Järvinen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Hans de Goede <[email protected]> Signed-off-by: Hans de Goede <[email protected]>
2024-04-28Merge tag 'x86-urgent-2024-04-28' of ↵Linus Torvalds1-0/+11
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Ingo Molnar: - Make the CPU_MITIGATIONS=n interaction with conflicting mitigation-enabling boot parameters a bit saner. - Re-enable CPU mitigations by default on non-x86 - Fix TDX shared bit propagation on mprotect() - Fix potential show_regs() system hang when PKE initialization is not fully finished yet. - Add the 0x10-0x1f model IDs to the Zen5 range - Harden #VC instruction emulation some more * tag 'x86-urgent-2024-04-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: cpu: Ignore "mitigations" kernel parameter if CPU_MITIGATIONS=n cpu: Re-enable CPU mitigations by default for !X86 architectures x86/tdx: Preserve shared bit on mprotect() x86/cpu: Fix check for RDPKRU in __show_regs() x86/CPU/AMD: Add models 0x10-0x1f to the Zen5 range x86/sev: Check for MWAITX and MONITORX opcodes in the #VC handler
2024-04-27profiling: Remove create_prof_cpu_mask().Tetsuo Handa1-5/+0
create_prof_cpu_mask() is no longer used after commit 1f44a225777e ("s390: convert interrupt handling to use generic hardirq"). Signed-off-by: Tetsuo Handa <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
2024-04-26Merge tag 'for-netdev' of ↵Jakub Kicinski2-0/+3
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Daniel Borkmann says: ==================== pull-request: bpf 2024-04-26 We've added 12 non-merge commits during the last 22 day(s) which contain a total of 14 files changed, 168 insertions(+), 72 deletions(-). The main changes are: 1) Fix BPF_PROBE_MEM in verifier and JIT to skip loads from vsyscall page, from Puranjay Mohan. 2) Fix a crash in XDP with devmap broadcast redirect when the latter map is in process of being torn down, from Toke Høiland-Jørgensen. 3) Fix arm64 and riscv64 BPF JITs to properly clear start time for BPF program runtime stats, from Xu Kuohai. 4) Fix a sockmap KCSAN-reported data race in sk_psock_skb_ingress_enqueue, from Jason Xing. 5) Fix BPF verifier error message in resolve_pseudo_ldimm64, from Anton Protopopov. 6) Fix missing DEBUG_INFO_BTF_MODULES Kconfig menu item, from Andrii Nakryiko. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: selftests/bpf: Test PROBE_MEM of VSYSCALL_ADDR on x86-64 bpf, x86: Fix PROBE_MEM runtime load check bpf: verifier: prevent userspace memory access xdp: use flags field to disambiguate broadcast redirect arm32, bpf: Reimplement sign-extension mov instruction riscv, bpf: Fix incorrect runtime stats bpf, arm64: Fix incorrect runtime stats bpf: Fix a verifier verbose message bpf, skmsg: Fix NULL pointer dereference in sk_psock_skb_ingress_enqueue MAINTAINERS: bpf: Add Lehui and Puranjay as riscv64 reviewers MAINTAINERS: Update email address for Puranjay Mohan bpf, kconfig: Fix DEBUG_INFO_BTF_MODULES Kconfig definition ==================== Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2024-04-26Merge tag 'soc-fixes-6.9-2' of ↵Linus Torvalds2-9/+56
git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc Pull ARM SoC fixes from Arnd Bergmann: "There are a lot of minor DT fixes for Mediatek, Rockchip, Qualcomm and Microchip and NXP, addressing both build-time warnings and bugs found during runtime testing. Most of these changes are machine specific fixups, but there are a few notable regressions that affect an entire SoC: - The Qualcomm MSI support that was improved for 6.9 ended up being wrong on some chips and now gets fixed. - The i.MX8MP camera interface broke due to a typo and gets updated again. The main driver fix is also for Qualcomm platforms, rewriting an interface in the QSEECOM firmware support that could lead to crashing the kernel from a trusted application. The only other code changes are minor fixes for Mediatek SoC drivers" * tag 'soc-fixes-6.9-2' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (50 commits) ARM: dts: imx6ull-tarragon: fix USB over-current polarity soc: mediatek: mtk-socinfo: depends on CONFIG_SOC_BUS soc: mediatek: mtk-svs: Append "-thermal" to thermal zone names arm64: dts: imx8mp: Fix assigned-clocks for second CSI2 ARM: dts: microchip: at91-sama7g54_curiosity: Replace regulator-suspend-voltage with the valid property ARM: dts: microchip: at91-sama7g5ek: Replace regulator-suspend-voltage with the valid property arm64: dts: rockchip: Fix USB interface compatible string on kobol-helios64 arm64: dts: qcom: sc8180x: Fix ss_phy_irq for secondary USB controller arm64: dts: qcom: sm8650: Fix the msi-map entries arm64: dts: qcom: sm8550: Fix the msi-map entries arm64: dts: qcom: sm8450: Fix the msi-map entries arm64: dts: qcom: sc8280xp: add missing PCIe minimum OPP arm64: dts: qcom: x1e80100: Fix the compatible for cluster idle states arm64: dts: qcom: Fix type of "wdog" IRQs for remoteprocs arm64: dts: rockchip: regulator for sd needs to be always on for BPI-R2Pro dt-bindings: rockchip: grf: Add missing type to 'pcie-phy' node arm64: dts: rockchip: drop redundant disable-gpios in Lubancat 2 arm64: dts: rockchip: drop redundant disable-gpios in Lubancat 1 arm64: dts: rockchip: drop redundant pcie-reset-suspend in Scarlet Dumo arm64: dts: rockchip: mark system power controller and fix typo on orangepi-5-plus ...
2024-04-26Merge tag 'mm-hotfixes-stable-2024-04-26-13-30' of ↵Linus Torvalds2-65/+87
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "11 hotfixes. 8 are cc:stable and the remaining 3 (nice ratio!) address post-6.8 issues or aren't considered suitable for backporting. All except one of these are for MM. I see no particular theme - it's singletons all over" * tag 'mm-hotfixes-stable-2024-04-26-13-30' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm/hugetlb: fix DEBUG_LOCKS_WARN_ON(1) when dissolve_free_hugetlb_folio() selftests: mm: protection_keys: save/restore nr_hugepages value from launch script stackdepot: respect __GFP_NOLOCKDEP allocation flag hugetlb: check for anon_vma prior to folio allocation mm: zswap: fix shrinker NULL crash with cgroup_disable=memory mm: turn folio_test_hugetlb into a PageType mm: support page_mapcount() on page_has_type() pages mm: create FOLIO_FLAG_FALSE and FOLIO_TYPE_OPS macros mm/hugetlb: fix missing hugetlb_lock for resv uncharge selftests: mm: fix unused and uninitialized variable warning selftests/harness: remove use of LINE_MAX
2024-04-26pwm: Add a struct device to struct pwm_chipUwe Kleine-König1-2/+3
This replaces the formerly dynamically allocated struct device. This allows to additionally use it to track the lifetime of the struct pwm_chip. Otherwise the new struct device provides the same sysfs API as was provided by the dynamic device before. Link: https://lore.kernel.org/r/35c65ea7f6de789a568ff39d7b6b4ce80de4b7dc.1710670958.git.u.kleine-koenig@pengutronix.de Signed-off-by: Uwe Kleine-König <[email protected]>
2024-04-26pwm: Ensure a struct pwm has the same lifetime as its pwm_chipUwe Kleine-König1-1/+1
It's required to not free the memory underlying a requested PWM while a consumer still has a reference to it. While currently a pwm_chip doesn't live long enough in all cases, linking the struct pwm to the pwm_chip results in the right lifetime as soon as the pwmchip is living long enough. This happens with the following commits. Note this is a breaking change for all pwm drivers that don't use pwmchip_alloc(). Reviewed-by: Gustavo A. R. Silva <[email protected]> # for struct_size() and __counted_by() Link: https://lore.kernel.org/r/7e9e958841f049026c0023b309cc9deecf0ab61d.1710670958.git.u.kleine-koenig@pengutronix.de Signed-off-by: Uwe Kleine-König <[email protected]>
2024-04-26pwm: Move contents of sysfs.c into core.cUwe Kleine-König1-13/+0
With the upcoming restructuring having all in a single file simplifies things a bit. The relevant and somewhat visible changes are: - Some dropped prototypes from include/linux/pwm.h that were only necessary that core.c has a declaration of the symbols defined in sysfs.c. The respective functions are static now. - The pwm class now also exists if CONFIG_SYSFS isn't enabled. Having CONFIG_SYSFS is not very relevant today, but even without it the class and device stuff still provides lifetime tracking. - Both files had an initcall, these are merged into a single one now. Instead of a big #ifdef block for CONFIG_DEBUG_FS, a single if (IS_ENABLED(CONFIG_DEBUG_FS)) is used now. This increases compile coverage a bit and is a tad nicer on the eyes. Link: https://lore.kernel.org/r/9e2d39a5280d7dda5bfc6682a8aef510148635b2.1710670958.git.u.kleine-koenig@pengutronix.de Signed-off-by: Uwe Kleine-König <[email protected]>
2024-04-26pwm: Ensure that pwm_chips are allocated using pwmchip_alloc()Uwe Kleine-König1-0/+2
Memory holding a struct device must not be freed before the reference count drops to zero. So a struct pwm_chip must not live in memory freed by a driver on unbind. All in-tree drivers were fixed accordingly, but as out-of-tree drivers, that were not adapted, still compile fine, catch these in pwmchip_add(). Link: https://lore.kernel.org/r/35f5b229c98f78b2f6ce2397259a4a936be477c0.1707900770.git.u.kleine-koenig@pengutronix.de Signed-off-by: Uwe Kleine-König <[email protected]>
2024-04-26cpufreq: amd-pstate: Add quirk for the pstate CPPC capabilities missingPerry Yuan1-0/+6
Add quirks table to get CPPC capabilities issue fixed by providing correct perf or frequency values while driver loading. If CPPC capabilities are not defined in the ACPI tables or wrongly defined by platform firmware, it needs to use quick to get those issues fixed with correct workaround values to make pstate driver can be loaded even though there are CPPC capabilities errors. The workaround will match the broken BIOS which lack of CPPC capabilities nominal_freq and lowest_freq definition in the ACPI table. $ cat /sys/devices/system/cpu/cpu0/acpi_cppc/lowest_freq 0 $ cat /sys/devices/system/cpu/cpu0/acpi_cppc/nominal_freq 0 Acked-by: Huang Rui <[email protected]> Reviewed-by: Mario Limonciello <[email protected]> Reviewed-by: Gautham R. Shenoy <[email protected]> Tested-by: Dhananjay Ugwekar <[email protected]> Signed-off-by: Perry Yuan <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2024-04-26cpufreq: amd-pstate: Document the units for freq variables in amd_cpudataGautham R. Shenoy1-7/+7
The min_limit_freq, max_limit_freq, min_freq, max_freq, nominal_freq and the lowest_nominal_freq members of struct cpudata store the frequency value in khz to be consistent with the cpufreq core. Update the comment to document this. Reviewed-by: Li Meng <[email protected]> Tested-by: Dhananjay Ugwekar <[email protected]> Signed-off-by: Gautham R. Shenoy <[email protected]> Signed-off-by: Perry Yuan <[email protected]> Acked-by: Huang Rui <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2024-04-26cpufreq: amd-pstate: Document *_limit_* fields in struct amd_cpudataGautham R. Shenoy1-0/+4
The four fields of struct cpudata namely min_limit_perf, max_limit_perf, min_limit_freq, max_limit_freq introduced in the commit febab20caeba("cpufreq/amd-pstate: Fix scaling_min_freq and scaling_max_freq update") are currently undocumented Add comments describing these fields Acked-by: Huang Rui <[email protected]> Fixes: febab20caeba("cpufreq/amd-pstate: Fix scaling_min_freq and scaling_max_freq update") Reviewed-by: Li Meng <[email protected]> Tested-by: Dhananjay Ugwekar <[email protected]> Signed-off-by: Gautham R. Shenoy <[email protected]> Signed-off-by: Perry Yuan <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2024-04-26bpf: verifier: prevent userspace memory accessPuranjay Mohan1-0/+1
With BPF_PROBE_MEM, BPF allows de-referencing an untrusted pointer. To thwart invalid memory accesses, the JITs add an exception table entry for all such accesses. But in case the src_reg + offset is a userspace address, the BPF program might read that memory if the user has mapped it. Make the verifier add guard instructions around such memory accesses and skip the load if the address falls into the userspace region. The JITs need to implement bpf_arch_uaddress_limit() to define where the userspace addresses end for that architecture or TASK_SIZE is taken as default. The implementation is as follows: REG_AX = SRC_REG if(offset) REG_AX += offset; REG_AX >>= 32; if (REG_AX <= (uaddress_limit >> 32)) DST_REG = 0; else DST_REG = *(size *)(SRC_REG + offset); Comparing just the upper 32 bits of the load address with the upper 32 bits of uaddress_limit implies that the values are being aligned down to a 4GB boundary before comparison. The above means that all loads with address <= uaddress_limit + 4GB are skipped. This is acceptable because there is a large hole (much larger than 4GB) between userspace and kernel space memory, therefore a correctly functioning BPF program should not access this 4GB memory above the userspace. Let's analyze what this patch does to the following fentry program dereferencing an untrusted pointer: SEC("fentry/tcp_v4_connect") int BPF_PROG(fentry_tcp_v4_connect, struct sock *sk) { *(volatile long *)sk; return 0; } BPF Program before | BPF Program after ------------------ | ----------------- 0: (79) r1 = *(u64 *)(r1 +0) 0: (79) r1 = *(u64 *)(r1 +0) ----------------------------------------------------------------------- 1: (79) r1 = *(u64 *)(r1 +0) --\ 1: (bf) r11 = r1 ----------------------------\ \ 2: (77) r11 >>= 32 2: (b7) r0 = 0 \ \ 3: (b5) if r11 <= 0x8000 goto pc+2 3: (95) exit \ \-> 4: (79) r1 = *(u64 *)(r1 +0) \ 5: (05) goto pc+1 \ 6: (b7) r1 = 0 \-------------------------------------- 7: (b7) r0 = 0 8: (95) exit As you can see from above, in the best case (off=0), 5 extra instructions are emitted. Now, we analyze the same program after it has gone through the JITs of ARM64 and RISC-V architectures. We follow the single load instruction that has the untrusted pointer and see what instrumentation has been added around it. x86-64 JIT ========== JIT's Instrumentation (upstream) --------------------- 0: nopl 0x0(%rax,%rax,1) 5: xchg %ax,%ax 7: push %rbp 8: mov %rsp,%rbp b: mov 0x0(%rdi),%rdi --------------------------------- f: movabs $0x800000000000,%r11 19: cmp %r11,%rdi 1c: jb 0x000000000000002a 1e: mov %rdi,%r11 21: add $0x0,%r11 28: jae 0x000000000000002e 2a: xor %edi,%edi 2c: jmp 0x0000000000000032 2e: mov 0x0(%rdi),%rdi --------------------------------- 32: xor %eax,%eax 34: leave 35: ret The x86-64 JIT already emits some instructions to protect against user memory access. This patch doesn't make any changes for the x86-64 JIT. ARM64 JIT ========= No Intrumentation Verifier's Instrumentation (upstream) (This patch) ----------------- -------------------------- 0: add x9, x30, #0x0 0: add x9, x30, #0x0 4: nop 4: nop 8: paciasp 8: paciasp c: stp x29, x30, [sp, #-16]! c: stp x29, x30, [sp, #-16]! 10: mov x29, sp 10: mov x29, sp 14: stp x19, x20, [sp, #-16]! 14: stp x19, x20, [sp, #-16]! 18: stp x21, x22, [sp, #-16]! 18: stp x21, x22, [sp, #-16]! 1c: stp x25, x26, [sp, #-16]! 1c: stp x25, x26, [sp, #-16]! 20: stp x27, x28, [sp, #-16]! 20: stp x27, x28, [sp, #-16]! 24: mov x25, sp 24: mov x25, sp 28: mov x26, #0x0 28: mov x26, #0x0 2c: sub x27, x25, #0x0 2c: sub x27, x25, #0x0 30: sub sp, sp, #0x0 30: sub sp, sp, #0x0 34: ldr x0, [x0] 34: ldr x0, [x0] -------------------------------------------------------------------------------- 38: ldr x0, [x0] ----------\ 38: add x9, x0, #0x0 -----------------------------------\\ 3c: lsr x9, x9, #32 3c: mov x7, #0x0 \\ 40: cmp x9, #0x10, lsl #12 40: mov sp, sp \\ 44: b.ls 0x0000000000000050 44: ldp x27, x28, [sp], #16 \\--> 48: ldr x0, [x0] 48: ldp x25, x26, [sp], #16 \ 4c: b 0x0000000000000054 4c: ldp x21, x22, [sp], #16 \ 50: mov x0, #0x0 50: ldp x19, x20, [sp], #16 \--------------------------------------- 54: ldp x29, x30, [sp], #16 54: mov x7, #0x0 58: add x0, x7, #0x0 58: mov sp, sp 5c: autiasp 5c: ldp x27, x28, [sp], #16 60: ret 60: ldp x25, x26, [sp], #16 64: nop 64: ldp x21, x22, [sp], #16 68: ldr x10, 0x0000000000000070 68: ldp x19, x20, [sp], #16 6c: br x10 6c: ldp x29, x30, [sp], #16 70: add x0, x7, #0x0 74: autiasp 78: ret 7c: nop 80: ldr x10, 0x0000000000000088 84: br x10 There are 6 extra instructions added in ARM64 in the best case. This will become 7 in the worst case (off != 0). RISC-V JIT (RISCV_ISA_C Disabled) ========== No Intrumentation Verifier's Instrumentation (upstream) (This patch) ----------------- -------------------------- 0: nop 0: nop 4: nop 4: nop 8: li a6, 33 8: li a6, 33 c: addi sp, sp, -16 c: addi sp, sp, -16 10: sd s0, 8(sp) 10: sd s0, 8(sp) 14: addi s0, sp, 16 14: addi s0, sp, 16 18: ld a0, 0(a0) 18: ld a0, 0(a0) --------------------------------------------------------------- 1c: ld a0, 0(a0) --\ 1c: mv t0, a0 --------------------------\ \ 20: srli t0, t0, 32 20: li a5, 0 \ \ 24: lui t1, 4096 24: ld s0, 8(sp) \ \ 28: sext.w t1, t1 28: addi sp, sp, 16 \ \ 2c: bgeu t1, t0, 12 2c: sext.w a0, a5 \ \--> 30: ld a0, 0(a0) 30: ret \ 34: j 8 \ 38: li a0, 0 \------------------------------ 3c: li a5, 0 40: ld s0, 8(sp) 44: addi sp, sp, 16 48: sext.w a0, a5 4c: ret There are 7 extra instructions added in RISC-V. Fixes: 800834285361 ("bpf, arm64: Add BPF exception tables") Reported-by: Breno Leitao <[email protected]> Suggested-by: Alexei Starovoitov <[email protected]> Acked-by: Ilya Leoshkevich <[email protected]> Signed-off-by: Puranjay Mohan <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2024-04-26Merge tag 'qcom-drivers-fixes-for-6.9' of ↵Arnd Bergmann2-9/+56
https://git.kernel.org/pub/scm/linux/kernel/git/qcom/linux into for-next Qualcomm driver fix for v6.9 This reworks the memory layout of the argument buffers passed to trusted applications in QSEECOM, to avoid failures and system crashes. * tag 'qcom-drivers-fixes-for-6.9' of https://git.kernel.org/pub/scm/linux/kernel/git/qcom/linux: firmware: qcom: uefisecapp: Fix memory related IO errors and crashes Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Arnd Bergmann <[email protected]>
2024-04-26virtio: add debugfs infrastructure to allow to debug virtio featuresJiri Pirko1-0/+35
Currently there is no way for user to set what features the driver should obey or not, it is hard wired in the code. In order to be able to debug the device behavior in case some feature is disabled, introduce a debugfs infrastructure with couple of files allowing user to see what features the device advertises and to set filter for features used by driver. Example: $cat /sys/bus/virtio/devices/virtio0/features 1110010111111111111101010000110010000000100000000000000000000000 $ echo "5" >/sys/kernel/debug/virtio/virtio0/filter_feature_add $ cat /sys/kernel/debug/virtio/virtio0/filter_features 5 $ echo "virtio0" > /sys/bus/virtio/drivers/virtio_net/unbind $ echo "virtio0" > /sys/bus/virtio/drivers/virtio_net/bind $ cat /sys/bus/virtio/devices/virtio0/features 1110000111111111111101010000110010000000100000000000000000000000 Note that sysfs "features" now already exists, this patch does not touch it. Signed-off-by: Jiri Pirko <[email protected]> Acked-by: Michael S. Tsirkin <[email protected]> Signed-off-by: Paolo Abeni <[email protected]>
2024-04-26Merge branch 'memory-observability' into x86/amdJoerg Roedel1-1/+4
2024-04-26iommu: Add ops->domain_alloc_sva()Jason Gunthorpe1-0/+3
Make a new op that receives the device and the mm_struct that the SVA domain should be created for. Unlike domain_alloc_paging() the dev argument is never NULL here. This allows drivers to fully initialize the SVA domain and allocate the mmu_notifier during allocation. It allows the notifier lifetime to follow the lifetime of the iommu_domain. Since we have only one call site, upgrade the new op to return ERR_PTR instead of NULL. Signed-off-by: Jason Gunthorpe <[email protected]> [Removed smmu3 related changes - Vasant] Signed-off-by: Vasant Hegde <[email protected]> Reviewed-by: Tina Zhang <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Joerg Roedel <[email protected]>
2024-04-26dma-mapping: Simplify arch_setup_dma_ops()Robin Murphy1-4/+2
The dma_base, size and iommu arguments are only used by ARM, and can now easily be deduced from the device itself, so there's no need to pass them through the callchain as well. Acked-by: Rob Herring <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Michael Kelley <[email protected]> # For Hyper-V Reviewed-by: Jason Gunthorpe <[email protected]> Tested-by: Hanjun Guo <[email protected]> Signed-off-by: Robin Murphy <[email protected]> Acked-by: Catalin Marinas <[email protected]> Link: https://lore.kernel.org/r/5291c2326eab405b1aa7693aa964e8d3cb7193de.1713523152.git.robin.murphy@arm.com Signed-off-by: Joerg Roedel <[email protected]>
2024-04-26iommu/dma: Centralise iommu_setup_dma_ops()Robin Murphy1-7/+0
It's somewhat hard to see, but arm64's arch_setup_dma_ops() should only ever call iommu_setup_dma_ops() after a successful iommu_probe_device(), which means there should be no harm in achieving the same order of operations by running it off the back of iommu_probe_device() itself. This then puts it in line with the x86 and s390 .probe_finalize bodges, letting us pull it all into the main flow properly. As a bonus this lets us fold in and de-scope the PCI workaround setup as well. At this point we can also then pull the call up inside the group mutex, and avoid having to think about whether iommu_group_store_type() could theoretically race and free the domain if iommu_setup_dma_ops() ran just *before* iommu_device_use_default_domain() claims it... Furthermore we replace one .probe_finalize call completely, since the only remaining implementations are now one which only needs to run once for the initial boot-time probe, and two which themselves render that path unreachable. This leaves us a big step closer to realistically being able to unpick the variety of different things that iommu_setup_dma_ops() has been muddling together, and further streamline iommu-dma into core API flows in future. Reviewed-by: Lu Baolu <[email protected]> # For Intel IOMMU Reviewed-by: Jason Gunthorpe <[email protected]> Tested-by: Hanjun Guo <[email protected]> Signed-off-by: Robin Murphy <[email protected]> Acked-by: Catalin Marinas <[email protected]> Link: https://lore.kernel.org/r/bebea331c1d688b34d9862eefd5ede47503961b8.1713523152.git.robin.murphy@arm.com Signed-off-by: Joerg Roedel <[email protected]>
2024-04-26dma-mapping: Add helpers for dma_range_map boundsRobin Murphy1-0/+18
Several places want to compute the lower and/or upper bounds of a dma_range_map, so let's factor that out into reusable helpers. Acked-by: Rob Herring <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Hanjun Guo <[email protected]> # For arm64 Reviewed-by: Jason Gunthorpe <[email protected]> Tested-by: Hanjun Guo <[email protected]> Signed-off-by: Robin Murphy <[email protected]> Link: https://lore.kernel.org/r/45ec52f033ec4dfb364e23f48abaf787f612fa53.1713523152.git.robin.murphy@arm.com Signed-off-by: Joerg Roedel <[email protected]>
2024-04-26ACPI/IORT: Handle memory address size limits as limitsRobin Murphy1-2/+2
Return the Root Complex/Named Component memory address size limit as an inclusive limit value, rather than an exclusive size. This saves having to fudge an off-by-one for the 64-bit case, and simplifies our caller. Acked-by: Hanjun Guo <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Tested-by: Hanjun Guo <[email protected]> Signed-off-by: Robin Murphy <[email protected]> Link: https://lore.kernel.org/r/284ae9fbadb12f2e3b5a30cd4d037d0e6843a8f4.1713523152.git.robin.murphy@arm.com Signed-off-by: Joerg Roedel <[email protected]>
2024-04-26iommu: Add ops->domain_alloc_sva()Jason Gunthorpe1-0/+3
Make a new op that receives the device and the mm_struct that the SVA domain should be created for. Unlike domain_alloc_paging() the dev argument is never NULL here. This allows drivers to fully initialize the SVA domain and allocate the mmu_notifier during allocation. It allows the notifier lifetime to follow the lifetime of the iommu_domain. Since we have only one call site, upgrade the new op to return ERR_PTR instead of NULL. Signed-off-by: Jason Gunthorpe <[email protected]> Signed-off-by: Vasant Hegde <[email protected]> Reviewed-by: Tina Zhang <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Lu Baolu <[email protected]> Reviewed-by: Kevin Tian <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Joerg Roedel <[email protected]>
2024-04-26iommu/vt-d: Remove private data use in fault messageJingqi Liu1-2/+1
According to Intel VT-d specification revision 4.0, "Private Data" field has been removed from Page Request/Response. Since the private data field is not used in fault message, remove the related definitions in page request descriptor and remove the related code in page request/response handler, as Intel hasn't shipped any products which support private data in the page request message. Signed-off-by: Jingqi Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Lu Baolu <[email protected]> Signed-off-by: Joerg Roedel <[email protected]>
2024-04-26iommu/vt-d: Allocate DMAR fault interrupts locallyDimitri Sivanich1-1/+1
The Intel IOMMU code currently tries to allocate all DMAR fault interrupt vectors on the boot cpu. On large systems with high DMAR counts this results in vector exhaustion, and most of the vectors are not initially allocated socket local. Instead, have a cpu on each node do the vector allocation for the DMARs on that node. The boot cpu still does the allocation for its node during its boot sequence. Signed-off-by: Dimitri Sivanich <[email protected]> Reviewed-by: Kevin Tian <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Lu Baolu <[email protected]> Signed-off-by: Joerg Roedel <[email protected]>
2024-04-26fs: Create anon_inode_getfile_fmode()Dawid Osuchowski1-0/+5
Creates an anon_inode_getfile_fmode() function that works similarly to anon_inode_getfile() with the addition of being able to set the fmode member. Signed-off-by: Dawid Osuchowski <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Christian Brauner <[email protected]>
2024-04-26drivers/perf: riscv: Implement SBI PMU snapshot functionAtish Patra1-0/+8
SBI v2.0 SBI introduced PMU snapshot feature which adds the following features. 1. Read counter values directly from the shared memory instead of csr read. 2. Start multiple counters with initial values with one SBI call. These functionalities optimizes the number of traps to the higher privilege mode. If the kernel is in VS mode while the hypervisor deploy trap & emulate method, this would minimize all the hpmcounter CSR read traps. If the kernel is running in S-mode, the benefits reduced to CSR latency vs DRAM/cache latency as there is no trap involved while accessing the hpmcounter CSRs. In both modes, it does saves the number of ecalls while starting multiple counter together with an initial values. This is a likely scenario if multiple counters overflow at the same time. Acked-by: Palmer Dabbelt <[email protected]> Reviewed-by: Anup Patel <[email protected]> Reviewed-by: Conor Dooley <[email protected]> Reviewed-by: Andrew Jones <[email protected]> Reviewed-by: Samuel Holland <[email protected]> Signed-off-by: Atish Patra <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Anup Patel <[email protected]>
2024-04-26mmc: slot-gpio: Use irq_handler_t typeAndy Shevchenko1-3/+2
The irq_handler_t is already defined globally, let's use it in slot-gpio code. Signed-off-by: Andy Shevchenko <[email protected]> Reviewed-by: Alexander Stein <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Ulf Hansson <[email protected]>
2024-04-26mmc: core: Add mmc_gpiod_set_cd_config() functionHans de Goede1-0/+1
Some mmc host drivers may need to fixup a card-detection GPIO's config to e.g. enable the GPIO controllers builtin pull-up resistor on devices where the firmware description of the GPIO is broken (e.g. GpioInt with PullNone instead of PullUp in ACPI DSDT). Since this is the exception rather then the rule adding a config parameter to mmc_gpiod_request_cd() seems undesirable, so instead add a new mmc_gpiod_set_cd_config() function. This is simply a wrapper to call gpiod_set_config() on the card-detect GPIO acquired through mmc_gpiod_request_cd(). Reviewed-by: Andy Shevchenko <[email protected]> Signed-off-by: Hans de Goede <[email protected]> Acked-by: Adrian Hunter <[email protected]> Cc: [email protected] Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Ulf Hansson <[email protected]>
2024-04-26xfrm: Preserve vlan tags for transport mode software GROPaul Davey1-0/+15
The software GRO path for esp transport mode uses skb_mac_header_rebuild prior to re-injecting the packet via the xfrm_napi_dev. This only copies skb->mac_len bytes of header which may not be sufficient if the packet contains 802.1Q tags or other VLAN tags. Worse copying only the initial header will leave a packet marked as being VLAN tagged but without the corresponding tag leading to mangling when it is later untagged. The VLAN tags are important when receiving the decrypted esp transport mode packet after GRO processing to ensure it is received on the correct interface. Therefore record the full mac header length in xfrm*_transport_input for later use in corresponding xfrm*_transport_finish to copy the entire mac header when rebuilding the mac header for GRO. The skb->data pointer is left pointing skb->mac_header bytes after the start of the mac header as is expected by the network stack and network and transport header offsets reset to this location. Fixes: 7785bba299a8 ("esp: Add a software GRO codepath") Signed-off-by: Paul Davey <[email protected]> Signed-off-by: Steffen Klassert <[email protected]>
2024-04-25instrumented.h: add instrument_memcpy_before, instrument_memcpy_afterAlexander Potapenko1-0/+35
Bug detection tools based on compiler instrumentation may miss memory accesses in custom memcpy implementations (such as copy_mc_to_kernel). Provide instrumentation hooks that tell KASAN, KCSAN, and KMSAN about such accesses. Link: https://lore.kernel.org/all/[email protected]/ Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Alexander Potapenko <[email protected]> Reviewed-by: Marco Elver <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Tetsuo Handa <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-04-25mm: kmsan: implement kmsan_memmove()Alexander Potapenko1-0/+15
Provide a hook that can be used by custom memcpy implementations to tell KMSAN that the metadata needs to be copied. Without that, false positive reports are possible in the cases where KMSAN fails to intercept memory initialization. Link: https://lore.kernel.org/all/[email protected]/ Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Alexander Potapenko <[email protected]> Suggested-by: Tetsuo Handa <[email protected]> Reviewed-by: Marco Elver <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Tetsuo Handa <[email protected]> Cc: Thomas Gleixner <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-04-25mm: set pageblock_order to HPAGE_PMD_ORDER in case with !CONFIG_HUGETLB_PAGE ↵Baolin Wang1-2/+6
but THP enabled As Vlastimil suggested in previous discussion[1], it doesn't make sense to set pageblock_order as MAX_PAGE_ORDER when hugetlbfs is not enabled and THP is enabled. Instead, it should be set to HPAGE_PMD_ORDER. [1] https://lore.kernel.org/all/[email protected]/ Link: https://lkml.kernel.org/r/3d57d253070035bdc0f6d6e5681ce1ed0e1934f7.1712286863.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <[email protected]> Suggested-by: Vlastimil Babka <[email protected]> Acked-by: Vlastimil Babka <[email protected]> Reviewed-by: Zi Yan <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Mel Gorman <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-04-25mm: inline destroy_large_folio() into __folio_put_large()Matthew Wilcox (Oracle)1-2/+0
destroy_large_folio() has only one caller, move its contents there. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: Zi Yan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-04-25mm/ksm: remove redundant code in ksm_forkJinjiang Tu1-10/+2
Since commit 3c6f33b7273a ("mm/ksm: support fork/exec for prctl"), when a child process is forked, the MMF_VM_MERGE_ANY flag will be inherited in mm_init(). So, it's unnecessary to set the flag in ksm_fork(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Jinjiang Tu <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Kefeng Wang <[email protected]> Cc: Nanyong Sun <[email protected]> Cc: Rik van Riel <[email protected]> Cc: Stefan Roesch <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-04-25mm/treewide: rename CONFIG_HAVE_FAST_GUP to CONFIG_HAVE_GUP_FASTDavid Hildenbrand1-4/+4
Nowadays, we call it "GUP-fast", the external interface includes functions like "get_user_pages_fast()", and we renamed all internal functions to reflect that as well. Let's make the config option reflect that. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Reviewed-by: Mike Rapoport (IBM) <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Reviewed-by: John Hubbard <[email protected]> Cc: Peter Xu <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-04-25mm: madvise: avoid split during MADV_PAGEOUT and MADV_COLDRyan Roberts1-0/+30
Rework madvise_cold_or_pageout_pte_range() to avoid splitting any large folio that is fully and contiguously mapped in the pageout/cold vm range. This change means that large folios will be maintained all the way to swap storage. This both improves performance during swap-out, by eliding the cost of splitting the folio, and sets us up nicely for maintaining the large folio when it is swapped back in (to be covered in a separate series). Folios that are not fully mapped in the target range are still split, but note that behavior is changed so that if the split fails for any reason (folio locked, shared, etc) we now leave it as is and move to the next pte in the range and continue work on the proceeding folios. Previously any failure of this sort would cause the entire operation to give up and no folios mapped at higher addresses were paged out or made cold. Given large folios are becoming more common, this old behavior would have likely lead to wasted opportunities. While we are at it, change the code that clears young from the ptes to use ptep_test_and_clear_young(), via the new mkold_ptes() batch helper function. This is more efficent than get_and_clear/modify/set, especially for contpte mappings on arm64, where the old approach would require unfolding/refolding and the new approach can be done in place. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ryan Roberts <[email protected]> Reviewed-by: Barry Song <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Barry Song <[email protected]> Cc: Chris Li <[email protected]> Cc: Gao Xiang <[email protected]> Cc: "Huang, Ying" <[email protected]> Cc: Kefeng Wang <[email protected]> Cc: Lance Yang <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Yang Shi <[email protected]> Cc: Yu Zhao <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-04-25mm: swap: allow storage of all mTHP ordersRyan Roberts1-1/+7
Multi-size THP enables performance improvements by allocating large, pte-mapped folios for anonymous memory. However I've observed that on an arm64 system running a parallel workload (e.g. kernel compilation) across many cores, under high memory pressure, the speed regresses. This is due to bottlenecking on the increased number of TLBIs added due to all the extra folio splitting when the large folios are swapped out. Therefore, solve this regression by adding support for swapping out mTHP without needing to split the folio, just like is already done for PMD-sized THP. This change only applies when CONFIG_THP_SWAP is enabled, and when the swap backing store is a non-rotating block device. These are the same constraints as for the existing PMD-sized THP swap-out support. Note that no attempt is made to swap-in (m)THP here - this is still done page-by-page, like for PMD-sized THP. But swapping-out mTHP is a prerequisite for swapping-in mTHP. The main change here is to improve the swap entry allocator so that it can allocate any power-of-2 number of contiguous entries between [1, (1 << PMD_ORDER)]. This is done by allocating a cluster for each distinct order and allocating sequentially from it until the cluster is full. This ensures that we don't need to search the map and we get no fragmentation due to alignment padding for different orders in the cluster. If there is no current cluster for a given order, we attempt to allocate a free cluster from the list. If there are no free clusters, we fail the allocation and the caller can fall back to splitting the folio and allocates individual entries (as per existing PMD-sized THP fallback). The per-order current clusters are maintained per-cpu using the existing infrastructure. This is done to avoid interleving pages from different tasks, which would prevent IO being batched. This is already done for the order-0 allocations so we follow the same pattern. As is done for order-0 per-cpu clusters, the scanner now can steal order-0 entries from any per-cpu-per-order reserved cluster. This ensures that when the swap file is getting full, space doesn't get tied up in the per-cpu reserves. This change only modifies swap to be able to accept any order mTHP. It doesn't change the callers to elide doing the actual split. That will be done in separate changes. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ryan Roberts <[email protected]> Reviewed-by: "Huang, Ying" <[email protected]> Cc: Barry Song <[email protected]> Cc: Barry Song <[email protected]> Cc: Chris Li <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Gao Xiang <[email protected]> Cc: Kefeng Wang <[email protected]> Cc: Lance Yang <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Yang Shi <[email protected]> Cc: Yu Zhao <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-04-25mm: swap: update get_swap_pages() to take folio orderRyan Roberts1-1/+1
We are about to allow swap storage of any mTHP size. To prepare for that, let's change get_swap_pages() to take a folio order parameter instead of nr_pages. This makes the interface self-documenting; a power-of-2 number of pages must be provided. We will also need the order internally so this simplifies accessing it. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ryan Roberts <[email protected]> Reviewed-by: "Huang, Ying" <[email protected]> Reviewed-by: David Hildenbrand <[email protected]> Cc: Barry Song <[email protected]> Cc: Barry Song <[email protected]> Cc: Chris Li <[email protected]> Cc: Gao Xiang <[email protected]> Cc: Kefeng Wang <[email protected]> Cc: Lance Yang <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Yang Shi <[email protected]> Cc: Yu Zhao <[email protected]> Signed-off-by: Andrew Morton <[email protected]>