aboutsummaryrefslogtreecommitdiff
path: root/include/linux
AgeCommit message (Collapse)AuthorFilesLines
2024-08-20coresight: Remove pending trace ID release mechanismJames Clark1-4/+2
Pending the release of IDs was a way of managing concurrent sysfs and Perf sessions in a single global ID map. Perf may have finished while sysfs hadn't, and Perf shouldn't release the IDs in use by sysfs and vice versa. Now that Perf uses its own exclusive ID maps, pending release doesn't result in any different behavior than just releasing all IDs when the last Perf session finishes. As part of the per-sink trace ID change, we would have still had to make the pending mechanism work on a per-sink basis, due to the overlapping ID allocations, so instead of making that more complicated, just remove it. Signed-off-by: James Clark <[email protected]> Reviewed-by: Mike Leach <[email protected]> Signed-off-by: James Clark <[email protected]> Signed-off-by: Suzuki K Poulose <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-08-20coresight: Use per-sink trace ID maps for Perf sessionsJames Clark1-1/+2
This will allow sessions with more than CORESIGHT_TRACE_IDS_MAX ETMs as long as there are fewer than that many ETMs connected to each sink. Each sink owns its own trace ID map, and any Perf session connecting to that sink will allocate from it, even if the sink is currently in use by other users. This is similar to the existing behavior where the dynamic trace IDs are constant as long as there is any concurrent Perf session active. It's not completely optimal because slightly more IDs will be used than necessary, but the optimal solution involves tracking the PIDs of each session and allocating ID maps based on the session owner. This is difficult to do with the combination of per-thread and per-cpu modes and some scheduling issues. The complexity of this isn't likely to worth it because even with multiple users they'd just see a difference in the ordering of ID allocations rather than hitting any limits (unless the hardware does have too many ETMs connected to one sink). Signed-off-by: James Clark <[email protected]> Reviewed-by: Mike Leach <[email protected]> Signed-off-by: James Clark <[email protected]> Signed-off-by: Suzuki K Poulose <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-08-20coresight: Make CPU id map a property of a trace ID mapJames Clark1-0/+1
The global CPU ID mappings won't work for per-sink ID maps so move it to the ID map struct. coresight_trace_id_release_all_pending() is hard coded to operate on the default map, but once Perf sessions use their own maps the pending release mechanism will be deleted. So it doesn't need to be extended to accept a trace ID map argument at this point. Signed-off-by: James Clark <[email protected]> Reviewed-by: Mike Leach <[email protected]> Tested-by: Leo Yan <[email protected]> Signed-off-by: James Clark <[email protected]> Signed-off-by: Suzuki K Poulose <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-08-20coresight: Move struct coresight_trace_id_map to common headerJames Clark1-0/+18
The trace ID maps will need to be created and stored by the core and Perf code so move the definition up to the common header. Reviewed-by: Anshuman Khandual <[email protected]> Reviewed-by: Mike Leach <[email protected]> Signed-off-by: James Clark <[email protected]> Tested-by: Leo Yan <[email protected]> Tested-by: Ganapatrao Kulkarni <[email protected]> Signed-off-by: James Clark <[email protected]> Signed-off-by: Suzuki K Poulose <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2024-08-20x86/kaslr: Expose and use the end of the physical memory address spaceThomas Gleixner1-0/+4
iounmap() on x86 occasionally fails to unmap because the provided valid ioremap address is not below high_memory. It turned out that this happens due to KASLR. KASLR uses the full address space between PAGE_OFFSET and vaddr_end to randomize the starting points of the direct map, vmalloc and vmemmap regions. It thereby limits the size of the direct map by using the installed memory size plus an extra configurable margin for hot-plug memory. This limitation is done to gain more randomization space because otherwise only the holes between the direct map, vmalloc, vmemmap and vaddr_end would be usable for randomizing. The limited direct map size is not exposed to the rest of the kernel, so the memory hot-plug and resource management related code paths still operate under the assumption that the available address space can be determined with MAX_PHYSMEM_BITS. request_free_mem_region() allocates from (1 << MAX_PHYSMEM_BITS) - 1 downwards. That means the first allocation happens past the end of the direct map and if unlucky this address is in the vmalloc space, which causes high_memory to become greater than VMALLOC_START and consequently causes iounmap() to fail for valid ioremap addresses. MAX_PHYSMEM_BITS cannot be changed for that because the randomization does not align with address bit boundaries and there are other places which actually require to know the maximum number of address bits. All remaining usage sites of MAX_PHYSMEM_BITS have been analyzed and found to be correct. Cure this by exposing the end of the direct map via PHYSMEM_END and use that for the memory hot-plug and resource management related places instead of relying on MAX_PHYSMEM_BITS. In the KASLR case PHYSMEM_END maps to a variable which is initialized by the KASLR initialization and otherwise it is based on MAX_PHYSMEM_BITS as before. To prevent future hickups add a check into add_pages() to catch callers trying to add memory above PHYSMEM_END. Fixes: 0483e1fa6e09 ("x86/mm: Implement ASLR for kernel memory regions") Reported-by: Max Ramanouski <[email protected]> Reported-by: Alistair Popple <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Tested-By: Max Ramanouski <[email protected]> Tested-by: Alistair Popple <[email protected]> Reviewed-by: Dan Williams <[email protected]> Reviewed-by: Alistair Popple <[email protected]> Reviewed-by: Kees Cook <[email protected]> Cc: [email protected] Link: https://lore.kernel.org/all/87ed6soy3z.ffs@tglx
2024-08-20PM: domains: add device managed version of dev_pm_domain_attach|detach_list()Dikshita Agarwal1-0/+11
Add the devres-enabled version of dev_pm_domain_attach|detach_list. If client drivers use devm_pm_domain_attach_list() to attach the PM domains, devm_pm_domain_detach_list() will be invoked implicitly during remove phase. Signed-off-by: Dikshita Agarwal <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Ulf Hansson <[email protected]>
2024-08-20net: add copy from skb_seq_state to buffer functionChristian Hopps1-0/+1
Add an skb helper function to copy a range of bytes from within an existing skb_seq_state. Signed-off-by: Christian Hopps <[email protected]> Signed-off-by: Steffen Klassert <[email protected]>
2024-08-19bpf: Allow bpf_current_task_under_cgroup() with BPF_CGROUP_*Matteo Croce1-0/+1
The helper bpf_current_task_under_cgroup() currently is only allowed for tracing programs, allow its usage also in the BPF_CGROUP_* program types. Move the code from kernel/trace/bpf_trace.c to kernel/bpf/helpers.c, so it compiles also without CONFIG_BPF_EVENTS. This will be used in systemd-networkd to monitor the sysctl writes, and filter it's own writes from others: https://github.com/systemd/systemd/pull/32212 Signed-off-by: Matteo Croce <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2024-08-19workqueue: Fix htmldocs build warningTejun Heo1-1/+1
Fix htmldocs build warning introduced by ec0a7d44b358 ("workqueue: Add interface for user-defined workqueue lockdep map"). Signed-off-by: Tejun Heo <[email protected]> Reported-by: Stephen Rothwell <[email protected]> Cc: Matthew Brost <[email protected]>
2024-08-19ALSA/ASoC/SoundWire: Intel: update maximum numberMark Brown1-0/+8
Merge series from Bard Liao <[email protected]>: Intel new platforms can have up to 5 SoundWire links. This series does not apply to SoundWire tree due to recent changes in machine driver. Can we go via ASoC tree with Vinod's Acked-by tag?
2024-08-19x86: support user address masking instead of non-speculative conditionalLinus Torvalds1-0/+7
The Spectre-v1 mitigations made "access_ok()" much more expensive, since it has to serialize execution with the test for a valid user address. All the normal user copy routines avoid this by just masking the user address with a data-dependent mask instead, but the fast "unsafe_user_read()" kind of patterms that were supposed to be a fast case got slowed down. This introduces a notion of using src = masked_user_access_begin(src); to do the user address sanity using a data-dependent mask instead of the more traditional conditional if (user_read_access_begin(src, len)) { model. This model only works for dense accesses that start at 'src' and on architectures that have a guard region that is guaranteed to fault in between the user space and the kernel space area. With this, the user access doesn't need to be manually checked, because a bad address is guaranteed to fault (by some architecture masking trick: on x86-64 this involves just turning an invalid user address into all ones, since we don't map the top of address space). This only converts a couple of examples for now. Example x86-64 code generation for loading two words from user space: stac mov %rax,%rcx sar $0x3f,%rcx or %rax,%rcx mov (%rcx),%r13 mov 0x8(%rcx),%r14 clac where all the error handling and -EFAULT is now purely handled out of line by the exception path. Of course, if the micro-architecture does badly at 'clac' and 'stac', the above is still pitifully slow. But at least we did as well as we could. Signed-off-by: Linus Torvalds <[email protected]>
2024-08-19Merge tag 'printk-for-6.11-rc5' of ↵Linus Torvalds1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux Pull printk fix from Petr Mladek: - Do not block printk on non-panic CPUs when they are dumping backtraces * tag 'printk-for-6.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux: printk/panic: Allow cpu backtraces to be written into ringbuffer during panic
2024-08-19block: Drop NULL check in bdev_write_zeroes_sectors()John Garry1-6/+1
Function bdev_get_queue() must not return NULL, so drop the check in bdev_write_zeroes_sectors(). Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: John Garry <[email protected]> Reviewed-by: Martin K. Petersen <[email protected]> Reviewed-by: Nitesh Shetty <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jens Axboe <[email protected]>
2024-08-19Merge back thermal core material for 6.12.Rafael J. Wysocki1-3/+0
2024-08-19percpu-rwsem: remove the unused parameter 'read'Wang Long2-2/+2
In the function percpu_rwsem_release, the parameter `read` is unused, so remove it. Signed-off-by: Wang Long <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Jan Kara <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
2024-08-19fs: don't flush in-flight wb switches for superblocks without cgroup writebackHaifeng Xu1-2/+2
When deactivating any type of superblock, it had to wait for the in-flight wb switches to be completed. wb switches are executed in inode_switch_wbs_work_fn() which needs to acquire the wb_switch_rwsem and races against sync_inodes_sb(). If there are too much dirty data in the superblock, the waiting time may increase significantly. For superblocks without cgroup writeback such as tmpfs, they have nothing to do with the wb swithes, so the flushing can be avoided. Signed-off-by: Haifeng Xu <[email protected]> Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Jan Kara <[email protected]> Suggested-by: Jan Kara <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
2024-08-19soundwire: intel: increase maximum number of linksPierre-Louis Bossart1-1/+1
Intel platforms have enabled 4 links since the beginning, newer platforms now have 5 links. Update the definition accordingly. This patch will have no effect on older platforms where the number of links was hard-coded. A follow-up patch will add a dynamic check that the ACPI-reported information is aligned with hardware capabilities on newer platforms. Acked-by: Vinod Koul <[email protected]> Signed-off-by: Pierre-Louis Bossart <[email protected]> Reviewed-by: Péter Ujfalusi <[email protected]> Signed-off-by: Bard Liao <[email protected]> Acked-by: Mark Brown <[email protected]> Reviewed-by: Takashi Iwai <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Mark Brown <[email protected]>
2024-08-19soundwire: intel: add probe-time check on link idPierre-Louis Bossart1-0/+3
In older platforms, the number of links was constant and hard-coded to 4. Newer platforms can have varying number of links, so we need to add a probe-time check to make sure the ACPI-reported information with _DSD properties is aligned with hardware capabilities reported in the SoundWire LCAP register. Acked-by: Vinod Koul <[email protected]> Signed-off-by: Pierre-Louis Bossart <[email protected]> Reviewed-by: Péter Ujfalusi <[email protected]> Signed-off-by: Bard Liao <[email protected]> Acked-by: Mark Brown <[email protected]> Reviewed-by: Takashi Iwai <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Mark Brown <[email protected]>
2024-08-19ALSA/ASoC/SoundWire: Intel: use single definition for SDW_INTEL_MAX_LINKSPierre-Louis Bossart1-0/+5
The definitions are currently duplicated in intel-sdw-acpi.c and sof_sdw.c. Move the definition to the sdw_intel.h header, and change the prefix to make it Intel-specific. No functionality change in this patch. Signed-off-by: Pierre-Louis Bossart <[email protected]> Reviewed-by: Péter Ujfalusi <[email protected]> Signed-off-by: Bard Liao <[email protected]> Reviewed-by: Takashi Iwai <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Mark Brown <[email protected]>
2024-08-19Merge 6.11-rc4 into tty-nextGreg Kroah-Hartman16-31/+111
We need the tty/serial fixes in here as well. Signed-off-by: Greg Kroah-Hartman <[email protected]>
2024-08-19Merge 6.11-rc4 into usb-nextGreg Kroah-Hartman16-31/+111
We need the usb / thunderbolt fixes in here as well. Signed-off-by: Greg Kroah-Hartman <[email protected]>
2024-08-18nodemask: Switch from inline to __always_inlineYury Norov1-43/+43
'inline' keyword is only a recommendation for compiler. If it decides to not inline nodemask functions, the whole small_const_nbits() machinery doesn't work. This is how a standard GCC 11.3.0 does for my x86_64 build now. This patch replaces 'inline' directive with unconditional '__always_inline' to make sure that there's always a chance for compile-time optimization. It doesn't change size of kernel image, according to bloat-o-meter. [[ Brian: split out from: Subject: [PATCH 1/3] bitmap: switch from inline to __always_inline https://lore.kernel.org/all/[email protected]/ But rewritten, as there were too many conflicts. ]] Co-developed-by: Brian Norris <[email protected]> Signed-off-by: Brian Norris <[email protected]> Reviewed-by: Kees Cook <[email protected]> Reviewed-by: Nathan Chancellor <[email protected]> Signed-off-by: Yury Norov <[email protected]>
2024-08-18cpumask: Switch from inline to __always_inlineBrian Norris1-100/+112
On recent (v6.6+) builds with Clang (based on Clang 18.0.0) and certain configurations [0], I'm finding that (lack of) inlining decisions may lead to section mismatch warnings like the following: WARNING: modpost: vmlinux.o: section mismatch in reference: cpumask_andnot (section: .text) -> cpuhp_bringup_cpus_parallel.tmp_mask (section: .init.data) ERROR: modpost: Section mismatches detected. or more confusingly: WARNING: modpost: vmlinux: section mismatch in reference: cpumask_andnot+0x5f (section: .text) -> efi_systab_phys (section: .init.data) The first warning makes a little sense, because cpuhp_bringup_cpus_parallel() (an __init function) calls cpumask_andnot() on tmp_mask (an __initdata symbol). If the compiler doesn't inline cpumask_andnot(), this may appear like a mismatch. The second warning makes less sense, but might be because efi_systab_phys and cpuhp_bringup_cpus_parallel.tmp_mask are laid out near each other, and the latter isn't a proper C symbol definition. In any case, it seems a reasonable solution to suggest more strongly to the compiler that these cpumask macros *must* be inlined, as 'inline' is just a recommendation. This change has been previously proposed in the past as: Subject: [PATCH 1/3] bitmap: switch from inline to __always_inline https://lore.kernel.org/all/[email protected]/ But the change has been split up, to separately justify the cpumask changes (which drive my work) and the bitmap/const optimizations (that Yury separately proposed for other reasons). This ends up as somewhere between a "rebase" and "rewrite" -- I had to rewrite most of the patch. According to bloat-o-meter, vmlinux decreases minimally in size (-0.00% to -0.01%, depending on the version of GCC or Clang and .config in question) with this series of changes: gcc 13.2.0, x86_64_defconfig -3005 bytes, Before=21944501, After=21941496, chg -0.01% clang 16.0.6, x86_64_defconfig -105 bytes, Before=22571692, After=22571587, chg -0.00% gcc 9.5.0, x86_64_defconfig -1771 bytes, Before=21557598, After=21555827, chg -0.01% clang 18.0_pre516547 (ChromiumOS toolchain), x86_64_defconfig -191 bytes, Before=22615339, After=22615148, chg -0.00% clang 18.0_pre516547 (ChromiumOS toolchain), based on ChromiumOS config + gcov -979 bytes, Before=76294783, After=76293804, chg -0.00% [0] CONFIG_HOTPLUG_PARALLEL=y ('select'ed for x86 as of [1]) and CONFIG_GCOV_PROFILE_ALL. [1] commit 0c7ffa32dbd6 ("x86/smpboot/64: Implement arch_cpuhp_init_parallel_bringup() and enable it") Co-developed-by: Brian Norris <[email protected]> Signed-off-by: Brian Norris <[email protected]> Reviewed-by: Kees Cook <[email protected]> Reviewed-by: Nathan Chancellor <[email protected]> Signed-off-by: Yury Norov <[email protected]>
2024-08-18bitmap: Switch from inline to __always_inlineYury Norov1-64/+76
'inline' keyword is only a recommendation for compiler. If it decides to not inline bitmap functions, the whole small_const_nbits() machinery doesn't work. This is how a standard GCC 11.3.0 does for my x86_64 build now. This patch replaces 'inline' directive with unconditional '__always_inline' to make sure that there's always a chance for compile-time optimization. It doesn't change size of kernel image, according to bloat-o-meter. [[ Brian: split out from: Subject: [PATCH 1/3] bitmap: switch from inline to __always_inline https://lore.kernel.org/all/[email protected]/ But rewritten, as there were too many conflicts. ]] Co-developed-by: Brian Norris <[email protected]> Signed-off-by: Brian Norris <[email protected]> Reviewed-by: Kees Cook <[email protected]> Reviewed-by: Nathan Chancellor <[email protected]> Signed-off-by: Yury Norov <[email protected]>
2024-08-18find: Switch from inline to __always_inlineYury Norov1-25/+25
'inline' keyword is only a recommendation for compiler. If it decides to not inline find_bit nodemask functions, the whole small_const_nbits() machinery doesn't work. This is how a standard GCC 11.3.0 does for my x86_64 build now. This patch replaces 'inline' directive with unconditional '__always_inline' to make sure that there's always a chance for compile-time optimization. It doesn't change size of kernel image, according to bloat-o-meter. [[ Brian: split out from: Subject: [PATCH 1/3] bitmap: switch from inline to __always_inline https://lore.kernel.org/all/[email protected]/ But rewritten, as there were too many conflicts. ]] Co-developed-by: Brian Norris <[email protected]> Signed-off-by: Brian Norris <[email protected]> Reviewed-by: Kees Cook <[email protected]> Reviewed-by: Nathan Chancellor <[email protected]> Signed-off-by: Yury Norov <[email protected]>
2024-08-17Merge tag 'mm-hotfixes-stable-2024-08-17-19-34' of ↵Linus Torvalds5-19/+62
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "16 hotfixes. All except one are for MM. 10 of these are cc:stable and the others pertain to post-6.10 issues. As usual with these merges, singletons and doubletons all over the place, no identifiable-by-me theme. Please see the lovingly curated changelogs to get the skinny" * tag 'mm-hotfixes-stable-2024-08-17-19-34' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm/migrate: fix deadlock in migrate_pages_batch() on large folios alloc_tag: mark pages reserved during CMA activation as not tagged alloc_tag: introduce clear_page_tag_ref() helper function crash: fix riscv64 crash memory reserve dead loop selftests: memfd_secret: don't build memfd_secret test on unsupported arches mm: fix endless reclaim on machines with unaccepted memory selftests/mm: compaction_test: fix off by one in check_compaction() mm/numa: no task_numa_fault() call if PMD is changed mm/numa: no task_numa_fault() call if PTE is changed mm/vmalloc: fix page mapping if vm_area_alloc_pages() with high order fallback to order 0 mm/memory-failure: use raw_spinlock_t in struct memory_failure_cpu mm: don't account memmap per-node mm: add system wide stats items category mm: don't account memmap on failure mm/hugetlb: fix hugetlb vs. core-mm PT locking mseal: fix is_madv_discard()
2024-08-17Merge tag 'i2c-for-6.11-rc4' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux Pull i2c fixes from Wolfram Sang: "I2C core fix replacing IS_ENABLED() with IS_REACHABLE() For host drivers, there are two fixes: - Tegra I2C Controller: Addresses a potential double-locking issue during probe. ACPI devices are not IRQ-safe when invoking runtime suspend and resume functions, so the irq_safe flag should not be set. - Qualcomm GENI I2C Controller: Fixes an oversight in the exit path of the runtime_resume() function, which was missed in the previous release" * tag 'i2c-for-6.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux: i2c: tegra: Do not mark ACPI devices as irq safe i2c: Use IS_REACHABLE() for substituting empty ACPI functions i2c: qcom-geni: Add missing geni_icc_disable in geni_i2c_runtime_resume
2024-08-17sched/eevdf: Propagate min_slice up the cgroup hierarchyPeter Zijlstra1-0/+1
In the absence of an explicit cgroup slice configureation, make mixed slice length work with cgroups by propagating the min_slice up the hierarchy. This ensures the cgroup entity gets timely service to service its entities that have this timing constraint set on them. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Tested-by: Valentin Schneider <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2024-08-17sched/eevdf: Use sched_attr::sched_runtime to set request/slice suggestionPeter Zijlstra1-0/+1
Allow applications to directly set a suggested request/slice length using sched_attr::sched_runtime. The implementation clamps the value to: 0.1[ms] <= slice <= 100[ms] which is 1/10 the size of HZ=1000 and 10 times the size of HZ=100. Applications should strive to use their periodic runtime at a high confidence interval (95%+) as the target slice. Using a smaller slice will introduce undue preemptions, while using a larger value will increase latency. For all the following examples assume a scheduling quantum of 8, and for consistency all examples have W=4: {A,B,C,D}(w=1,r=8): ABCD... +---+---+---+--- t=0, V=1.5 t=1, V=3.5 A |------< A |------< B |------< B |------< C |------< C |------< D |------< D |------< ---+*------+-------+--- ---+--*----+-------+--- t=2, V=5.5 t=3, V=7.5 A |------< A |------< B |------< B |------< C |------< C |------< D |------< D |------< ---+----*--+-------+--- ---+------*+-------+--- Note: 4 identical tasks in FIFO order ~~~ {A,B}(w=1,r=16) C(w=2,r=16) AACCBBCC... +---+---+---+--- t=0, V=1.25 t=2, V=5.25 A |--------------< A |--------------< B |--------------< B |--------------< C |------< C |------< ---+*------+-------+--- ---+----*--+-------+--- t=4, V=8.25 t=6, V=12.25 A |--------------< A |--------------< B |--------------< B |--------------< C |------< C |------< ---+-------*-------+--- ---+-------+---*---+--- Note: 1 heavy task -- because q=8, double r such that the deadline of the w=2 task doesn't go below q. Note: observe the full schedule becomes: W*max(r_i/w_i) = 4*2q = 8q in length. Note: the period of the heavy task is half the full period at: W*(r_i/w_i) = 4*(2q/2) = 4q ~~~ {A,C,D}(w=1,r=16) B(w=1,r=8): BAACCBDD... +---+---+---+--- t=0, V=1.5 t=1, V=3.5 A |--------------< A |---------------< B |------< B |------< C |--------------< C |--------------< D |--------------< D |--------------< ---+*------+-------+--- ---+--*----+-------+--- t=3, V=7.5 t=5, V=11.5 A |---------------< A |---------------< B |------< B |------< C |--------------< C |--------------< D |--------------< D |--------------< ---+------*+-------+--- ---+-------+--*----+--- t=6, V=13.5 A |---------------< B |------< C |--------------< D |--------------< ---+-------+----*--+--- Note: 1 short task -- again double r so that the deadline of the short task won't be below q. Made B short because its not the leftmost task, but is eligible with the 0,1,2,3 spread. Note: like with the heavy task, the period of the short task observes: W*(r_i/w_i) = 4*(1q/1) = 4q ~~~ A(w=1,r=16) B(w=1,r=8) C(w=2,r=16) BCCAABCC... +---+---+---+--- t=0, V=1.25 t=1, V=3.25 A |--------------< A |--------------< B |------< B |------< C |------< C |------< ---+*------+-------+--- ---+--*----+-------+--- t=3, V=7.25 t=5, V=11.25 A |--------------< A |--------------< B |------< B |------< C |------< C |------< ---+------*+-------+--- ---+-------+--*----+--- t=6, V=13.25 A |--------------< B |------< C |------< ---+-------+----*--+--- Note: 1 heavy and 1 short task -- combine them all. Note: both the short and heavy task end up with a period of 4q ~~~ A(w=1,r=16) B(w=2,r=16) C(w=1,r=8) BBCAABBC... +---+---+---+--- t=0, V=1 t=2, V=5 A |--------------< A |--------------< B |------< B |------< C |------< C |------< ---+*------+-------+--- ---+----*--+-------+--- t=3, V=7 t=5, V=11 A |--------------< A |--------------< B |------< B |------< C |------< C |------< ---+------*+-------+--- ---+-------+--*----+--- t=7, V=15 A |--------------< B |------< C |------< ---+-------+------*+--- Note: as before but permuted ~~~ From all this it can be deduced that, for the steady state: - the total period (P) of a schedule is: W*max(r_i/w_i) - the average period of a task is: W*(r_i/w_i) - each task obtains the fair share: w_i/W of each full period P Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Tested-by: Valentin Schneider <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2024-08-17sched/fair: Avoid re-setting virtual deadline on 'migrations'Peter Zijlstra1-2/+4
During OSPM24 Youssef noted that migrations are re-setting the virtual deadline. Notably everything that does a dequeue-enqueue, like setting nice, changing preferred numa-node, and a myriad of other random crap, will cause this to happen. This shouldn't be. Preserve the relative virtual deadline across such dequeue/enqueue cycles. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Valentin Schneider <[email protected]> Tested-by: Valentin Schneider <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2024-08-17sched,freezer: Mark TASK_FROZEN specialPeter Zijlstra1-2/+3
The special task states are those that do not suffer spurious wakeups, TASK_FROZEN is very much one of those, mark it as such. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Valentin Schneider <[email protected]> Tested-by: Valentin Schneider <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2024-08-17sched: Prepare generic code for delayed dequeuePeter Zijlstra1-0/+1
While most of the delayed dequeue code can be done inside the sched_class itself, there is one location where we do not have an appropriate hook, namely ttwu_runnable(). Add an ENQUEUE_DELAYED call to the on_rq path to deal with waking delayed dequeue tasks. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Valentin Schneider <[email protected]> Tested-by: Valentin Schneider <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2024-08-17crypto: lib/mpi - Add error checks to extensionHerbert Xu1-11/+11
The remaining functions added by commit a8ea8bdd9df92a0e5db5b43900abb7a288b8a53e did not check for memory allocation errors. Add the checks and change the API to allow errors to be returned. Fixes: a8ea8bdd9df9 ("lib/mpi: Extend the MPI library") Signed-off-by: Herbert Xu <[email protected]>
2024-08-17Revert "lib/mpi: Extend the MPI library"Herbert Xu1-65/+0
This partially reverts commit a8ea8bdd9df92a0e5db5b43900abb7a288b8a53e. Most of it is no longer needed since sm2 has been removed. However, the following functions have been kept as they have developed other uses: mpi_copy mpi_mod mpi_test_bit mpi_set_bit mpi_rshift mpi_add mpi_sub mpi_addm mpi_subm mpi_mul mpi_mulm mpi_tdiv_r mpi_fdiv_r Signed-off-by: Herbert Xu <[email protected]>
2024-08-16Merge tag 'thermal-6.11-rc4' of ↵Linus Torvalds1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull thermal control fix from Rafael Wysocki: "Fix a Bang-bang thermal governor issue causing it to fail to reset the state of cooling devices if they are 'on' to start with, but the thermal zone temperature is always below the corresponding trip point (Rafael Wysocki)" * tag 'thermal-6.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: thermal: gov_bang_bang: Use governor_data to reduce overhead thermal: gov_bang_bang: Add .manage() callback thermal: gov_bang_bang: Split bang_bang_control() thermal: gov_bang_bang: Call __thermal_cdev_update() directly
2024-08-16Merge branch '40GbE' of ↵Jakub Kicinski1-1/+12
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue Tony Nguyen says: ==================== ice: iavf: add support for TC U32 filters on VFs Ahmed Zaki says: The Intel Ethernet 800 Series is designed with a pipeline that has an on-chip programmable capability called Dynamic Device Personalization (DDP). A DDP package is loaded by the driver during probe time. The DDP package programs functionality in both the parser and switching blocks in the pipeline, allowing dynamic support for new and existing protocols. Once the pipeline is configured, the driver can identify the protocol and apply any HW action in different stages, for example, direct packets to desired hardware queues (flow director), queue groups or drop. Patches 1-8 introduce a DDP package parser API that enables different pipeline stages in the driver to learn the HW parser capabilities from the DDP package that is downloaded to HW. The parser library takes raw packet patterns and masks (in binary) indicating the packet protocol fields to be matched and generates the final HW profiles that can be applied at the required stage. With this API, raw flow filtering for FDIR or RSS could be done on new protocols or headers without any driver or Kernel updates (only need to update the DDP package). These patches were submitted before [1] but were not accepted mainly due to lack of a user. Patches 9-11 extend the virtchnl support to allow the VF to request raw flow director filters. Upon receiving the raw FDIR filter request, the PF driver allocates and runs a parser lib instance and generates the hardware profile definitions required to program the FDIR stage. These were also submitted before [2]. Finally, patches 12 and 13 add TC U32 filter support to the iavf driver. Using the parser API, the ice driver runs the raw patterns sent by the user and then adds a new profile to the FDIR stage associated with the VF's VSI. Refer to examples in patch 13 commit message. [1]: https://lore.kernel.org/netdev/[email protected]/ [2]: https://lore.kernel.org/intel-wired-lan/[email protected]/ * '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue: iavf: add support for offloading tc U32 cls filters iavf: refactor add/del FDIR filters ice: enable FDIR filters from raw binary patterns for VFs ice: add method to disable FDIR SWAP option virtchnl: support raw packet in protocol header ice: add API for parser profile initialization ice: add UDP tunnels support to the parser ice: support turning on/off the parser's double vlan mode ice: add parser execution main loop ice: add parser internal helper functions ice: add debugging functions for the parser sections ice: parse and init various DDP parser sections ice: add parser create and destroy skeleton ==================== Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2024-08-16Merge tag 'iommu-fixes-v6.11-rc3' of ↵Linus Torvalds1-2/+0
git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux Pull iommu fixes from Joerg Roedel: - Bring back a lost return statement in io-page-fault code - Remove an unused function declaration * tag 'iommu-fixes-v6.11-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux: iommu: Remove unused declaration iommu_sva_unbind_gpasid() iommu: Restore lost return in iommu_report_device_fault()
2024-08-16Merge tag 'sound-6.11-rc4' of ↵Linus Torvalds1-1/+18
git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound Pull sound fixes from Takashi Iwai: "All small fixes, mostly for usual suspects, HD-audio and USB-audio device-specific fixes / quirks. The Cirrus codec support took the update of SPI header as well. Other than that, there is a regression fix in the sanity check of ALSA timer code" * tag 'sound-6.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: ALSA: hda/tas2781: Use correct endian conversion ALSA: usb-audio: Support Yamaha P-125 quirk entry ALSA: hda: cs35l41: Remove redundant call to hda_cs_dsp_control_remove() ALSA: hda: cs35l56: Remove redundant call to hda_cs_dsp_control_remove() ALSA: hda/tas2781: fix wrong calibrated data order ALSA: usb-audio: Add delay quirk for VIVO USB-C-XE710 HEADSET ALSA: hda/realtek: Add support for new HP G12 laptops ALSA: hda/realtek: Fix noise from speakers on Lenovo IdeaPad 3 15IAU7 ALSA: timer: Relax start tick time check for slave timer elements spi: Add empty versions of ACPI functions
2024-08-16perf: arm_pmuv3: Add support for Armv9.4 PMU instruction counterRob Herring (Arm)2-4/+10
Armv9.4/8.9 PMU adds optional support for a fixed instruction counter similar to the fixed cycle counter. Support for the feature is indicated in the ID_AA64DFR1_EL1 register PMICNTR field. The counter is not accessible in AArch32. Existing userspace using direct counter access won't know how to handle the fixed instruction counter, so we have to avoid using the counter when user access is requested. Acked-by: Mark Rutland <[email protected]> Signed-off-by: Rob Herring (Arm) <[email protected]> Tested-by: James Clark <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]>
2024-08-16KVM: arm64: Refine PMU defines for number of countersRob Herring (Arm)1-2/+0
There are 2 defines for the number of PMU counters: ARMV8_PMU_MAX_COUNTERS and ARMPMU_MAX_HWEVENTS. Both are the same currently, but Armv9.4/8.9 increases the number of possible counters from 32 to 33. With this change, the maximum number of counters will differ for KVM's PMU emulation which is PMUv3.4. Give KVM PMU emulation its own define to decouple it from the rest of the kernel's number PMU counters. The VHE PMU code needs to match the PMU driver, so switch it to use ARMPMU_MAX_HWEVENTS instead. Acked-by: Mark Rutland <[email protected]> Reviewed-by: Marc Zyngier <[email protected]> Signed-off-by: Rob Herring (Arm) <[email protected]> Tested-by: James Clark <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]>
2024-08-16arm64: perf/kvm: Use a common PMU cycle counter defineRob Herring (Arm)1-0/+3
The PMUv3 and KVM code each have a define for the PMU cycle counter index. Move KVM's define to a shared location and use it for PMUv3 driver. Reviewed-by: Marc Zyngier <[email protected]> Acked-by: Mark Rutland <[email protected]> Signed-off-by: Rob Herring (Arm) <[email protected]> Tested-by: James Clark <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]>
2024-08-16KVM: arm64: pmu: Use generated define for PMSELR_EL0.SEL accessRob Herring (Arm)1-1/+0
ARMV8_PMU_COUNTER_MASK is really a mask for the PMSELR_EL0.SEL register field. Make that clear by adding a standard sysreg definition for the register, and using it instead. Reviewed-by: Mark Rutland <[email protected]> Acked-by: Mark Rutland <[email protected]> Reviewed-by: Marc Zyngier <[email protected]> Signed-off-by: Rob Herring (Arm) <[email protected]> Tested-by: James Clark <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]>
2024-08-16perf: arm_pmu: Remove event index to counter remappingRob Herring (Arm)2-1/+2
Xscale and Armv6 PMUs defined the cycle counter at 0 and event counters starting at 1 and had 1:1 event index to counter numbering. On Armv7 and later, this changed the cycle counter to 31 and event counters start at 0. The drivers for Armv7 and PMUv3 kept the old event index numbering and introduced an event index to counter conversion. The conversion uses masking to convert from event index to a counter number. This operation relies on having at most 32 counters so that the cycle counter index 0 can be transformed to counter number 31. Armv9.4 adds support for an additional fixed function counter (instructions) which increases possible counters to more than 32, and the conversion won't work anymore as a simple subtract and mask. The primary reason for the translation (other than history) seems to be to have a contiguous mask of counters 0-N. Keeping that would result in more complicated index to counter conversions. Instead, store a mask of available counters rather than just number of events. That provides more information in addition to the number of events. No (intended) functional changes. Acked-by: Mark Rutland <[email protected]> Signed-off-by: Rob Herring (Arm) <[email protected]> Tested-by: James Clark <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]>
2024-08-16thermal: gov_bang_bang: Use governor_data to reduce overheadRafael J. Wysocki1-0/+1
After running once, the for_each_trip_desc() loop in bang_bang_manage() is pure needless overhead because it is not going to make any changes unless a new cooling device has been bound to one of the trips in the thermal zone or the system is resuming from sleep. For this reason, make bang_bang_manage() set governor_data for the thermal zone and check it upfront to decide whether or not it needs to do anything. However, governor_data needs to be reset in some cases to let bang_bang_manage() know that it should walk the trips again, so add an .update_tz() callback to the governor and make the core additionally invoke it during system resume. To avoid affecting the other users of that callback unnecessarily, add a special notification reason for system resume, THERMAL_TZ_RESUME, and also pass it to __thermal_zone_device_update() called during system resume for consistency. Signed-off-by: Rafael J. Wysocki <[email protected]> Acked-by: Peter Kästle <[email protected]> Reviewed-by: Zhang Rui <[email protected]> Cc: 6.10+ <[email protected]> # 6.10+ Link: https://patch.msgid.link/[email protected]
2024-08-16string: add mem_is_zero() helper to check if memory area is all zerosJani Nikula1-0/+12
Almost two thirds of the memchr_inv() usages check if the memory area is all zeros, with no interest in where in the buffer the first non-zero byte is located. Checking for !memchr_inv(s, 0, n) is also not very intuitive or discoverable. Add an explicit mem_is_zero() helper for this use case. Reviewed-by: Kees Cook <[email protected]> Reviewed-by: Andy Shevchenko <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] Signed-off-by: Jani Nikula <[email protected]>
2024-08-16net: mscc: ocelot: use ocelot_xmit_get_vlan_info() also for FDMA and ↵Vladimir Oltean1-0/+47
register injection Problem description ------------------- On an NXP LS1028A (felix DSA driver) with the following configuration: - ocelot-8021q tagging protocol - VLAN-aware bridge (with STP) spanning at least swp0 and swp1 - 8021q VLAN upper interfaces on swp0 and swp1: swp0.700, swp1.700 - ptp4l on swp0.700 and swp1.700 we see that the ptp4l instances do not see each other's traffic, and they all go to the grand master state due to the ANNOUNCE_RECEIPT_TIMEOUT_EXPIRES condition. Jumping to the conclusion for the impatient ------------------------------------------- There is a zero-day bug in the ocelot switchdev driver in the way it handles VLAN-tagged packet injection. The correct logic already exists in the source code, in function ocelot_xmit_get_vlan_info() added by commit 5ca721c54d86 ("net: dsa: tag_ocelot: set the classified VLAN during xmit"). But it is used only for normal NPI-based injection with the DSA "ocelot" tagging protocol. The other injection code paths (register-based and FDMA-based) roll their own wrong logic. This affects and was noticed on the DSA "ocelot-8021q" protocol because it uses register-based injection. By moving ocelot_xmit_get_vlan_info() to a place that's common for both the DSA tagger and the ocelot switch library, it can also be called from ocelot_port_inject_frame() in ocelot.c. We need to touch the lines with ocelot_ifh_port_set()'s prototype anyway, so let's rename it to something clearer regarding what it does, and add a kernel-doc. ocelot_ifh_set_basic() should do. Investigation notes ------------------- Debugging reveals that PTP event (aka those carrying timestamps, like Sync) frames injected into swp0.700 (but also swp1.700) hit the wire with two VLAN tags: 00000000: 01 1b 19 00 00 00 00 01 02 03 04 05 81 00 02 bc ~~~~~~~~~~~ 00000010: 81 00 02 bc 88 f7 00 12 00 2c 00 00 02 00 00 00 ~~~~~~~~~~~ 00000020: 00 00 00 00 00 00 00 00 00 00 00 01 02 ff fe 03 00000030: 04 05 00 01 00 04 00 00 00 00 00 00 00 00 00 00 00000040: 00 00 The second (unexpected) VLAN tag makes felix_check_xtr_pkt() -> ptp_classify_raw() fail to see these as PTP packets at the link partner's receiving end, and return PTP_CLASS_NONE (because the BPF classifier is not written to expect 2 VLAN tags). The reason why packets have 2 VLAN tags is because the transmission code treats VLAN incorrectly. Neither ocelot switchdev, nor felix DSA, declare the NETIF_F_HW_VLAN_CTAG_TX feature. Therefore, at xmit time, all VLANs should be in the skb head, and none should be in the hwaccel area. This is done by: static struct sk_buff *validate_xmit_vlan(struct sk_buff *skb, netdev_features_t features) { if (skb_vlan_tag_present(skb) && !vlan_hw_offload_capable(features, skb->vlan_proto)) skb = __vlan_hwaccel_push_inside(skb); return skb; } But ocelot_port_inject_frame() handles things incorrectly: ocelot_ifh_port_set(ifh, port, rew_op, skb_vlan_tag_get(skb)); void ocelot_ifh_port_set(struct sk_buff *skb, void *ifh, int port, u32 rew_op) { (...) if (vlan_tag) ocelot_ifh_set_vlan_tci(ifh, vlan_tag); (...) } The way __vlan_hwaccel_push_inside() pushes the tag inside the skb head is by calling: static inline void __vlan_hwaccel_clear_tag(struct sk_buff *skb) { skb->vlan_present = 0; } which does _not_ zero out skb->vlan_tci as seen by skb_vlan_tag_get(). This means that ocelot, when it calls skb_vlan_tag_get(), sees (and uses) a residual skb->vlan_tci, while the same VLAN tag is _already_ in the skb head. The trivial fix for double VLAN headers is to replace the content of ocelot_ifh_port_set() with: if (skb_vlan_tag_present(skb)) ocelot_ifh_set_vlan_tci(ifh, skb_vlan_tag_get(skb)); but this would not be correct either, because, as mentioned, vlan_hw_offload_capable() is false for us, so we'd be inserting dead code and we'd always transmit packets with VID=0 in the injection frame header. I can't actually test the ocelot switchdev driver and rely exclusively on code inspection, but I don't think traffic from 8021q uppers has ever been injected properly, and not double-tagged. Thus I'm blaming the introduction of VLAN fields in the injection header - early driver code. As hinted at in the early conclusion, what we _want_ to happen for VLAN transmission was already described once in commit 5ca721c54d86 ("net: dsa: tag_ocelot: set the classified VLAN during xmit"). ocelot_xmit_get_vlan_info() intends to ensure that if the port through which we're transmitting is under a VLAN-aware bridge, the outer VLAN tag from the skb head is stripped from there and inserted into the injection frame header (so that the packet is processed in hardware through that actual VLAN). And in all other cases, the packet is sent with VID=0 in the injection frame header, since the port is VLAN-unaware and has logic to strip this VID on egress (making it invisible to the wire). Fixes: 08d02364b12f ("net: mscc: fix the injection header") Signed-off-by: Vladimir Oltean <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2024-08-15alloc_tag: introduce clear_page_tag_ref() helper functionSuren Baghdasaryan1-0/+13
In several cases we are freeing pages which were not allocated using common page allocators. For such cases, in order to keep allocation accounting correct, we should clear the page tag to indicate that the page being freed is expected to not have a valid allocation tag. Introduce clear_page_tag_ref() helper function to be used for this. Link: https://lkml.kernel.org/r/[email protected] Fixes: d224eb0287fb ("codetag: debug: mark codetags for reserved pages as empty") Signed-off-by: Suren Baghdasaryan <[email protected]> Suggested-by: David Hildenbrand <[email protected]> Acked-by: David Hildenbrand <[email protected]> Reviewed-by: Pasha Tatashin <[email protected]> Cc: Kees Cook <[email protected]> Cc: Kent Overstreet <[email protected]> Cc: Sourav Panda <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: <[email protected]> [6.10] Signed-off-by: Andrew Morton <[email protected]>
2024-08-15mm: don't account memmap per-nodePasha Tatashin2-5/+4
Fix invalid access to pgdat during hot-remove operation: ndctl users reported a GPF when trying to destroy a namespace: $ ndctl destroy-namespace all -r all -f Segmentation fault dmesg: Oops: general protection fault, probably for non-canonical address 0xdffffc0000005650: 0000 [#1] PREEMPT SMP KASAN PTI KASAN: probably user-memory-access in range [0x000000000002b280-0x000000000002b287] CPU: 26 UID: 0 PID: 1868 Comm: ndctl Not tainted 6.11.0-rc1 #1 Hardware name: Dell Inc. PowerEdge R640/08HT8T, BIOS 2.20.1 09/13/2023 RIP: 0010:mod_node_page_state+0x2a/0x110 cxl-test users report a GPF when trying to unload the test module: $ modrpobe -r cxl-test dmesg BUG: unable to handle page fault for address: 0000000000004200 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: Oops: 0000 [#1] PREEMPT SMP PTI CPU: 0 UID: 0 PID: 1076 Comm: modprobe Tainted: G O N 6.11.0-rc1 #197 Tainted: [O]=OOT_MODULE, [N]=TEST Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/15 RIP: 0010:mod_node_page_state+0x6/0x90 Currently, when memory is hot-plugged or hot-removed the accounting is done based on the assumption that memmap is allocated from the same node as the hot-plugged/hot-removed memory, which is not always the case. In addition, there are challenges with keeping the node id of the memory that is being remove to the time when memmap accounting is actually performed: since this is done after remove_pfn_range_from_zone(), and also after remove_memory_block_devices(). Meaning that we cannot use pgdat nor walking though memblocks to get the nid. Given all of that, account the memmap overhead system wide instead. For this we are going to be using global atomic counters, but given that memmap size is rarely modified, and normally is only modified either during early boot when there is only one CPU, or under a hotplug global mutex lock, therefore there is no need for per-cpu optimizations. Also, while we are here rename nr_memmap to nr_memmap_pages, and nr_memmap_boot to nr_memmap_boot_pages to be self explanatory that the units are in page count. [[email protected]: address a few nits from David Hildenbrand] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: 15995a352474 ("mm: report per-page metadata information") Signed-off-by: Pasha Tatashin <[email protected]> Reported-by: Yi Zhang <[email protected]> Closes: https://lore.kernel.org/linux-cxl/CAHj4cs9Ax1=CoJkgBGP_+sNu6-6=6v=_L-ZBZY0bVLD3wUWZQg@mail.gmail.com Reported-by: Alison Schofield <[email protected]> Closes: https://lore.kernel.org/linux-mm/Zq0tPd2h6alFz8XF@aschofie-mobl2/#t Tested-by: Dan Williams <[email protected]> Tested-by: Alison Schofield <[email protected]> Acked-by: David Hildenbrand <[email protected]> Acked-by: David Rientjes <[email protected]> Tested-by: Yi Zhang <[email protected]> Cc: Domenico Cerasuolo <[email protected]> Cc: Fan Ni <[email protected]> Cc: Joel Granados <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Li Zhijian <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Muchun Song <[email protected]> Cc: Nhat Pham <[email protected]> Cc: Sourav Panda <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Yosry Ahmed <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-08-15mm: add system wide stats items categoryPasha Tatashin1-11/+4
/proc/vmstat contains events and stats, events can only grow, but stats can grow and shrink. vmstat has the following: ------------------------- NR_VM_ZONE_STAT_ITEMS: per-zone stats NR_VM_NUMA_EVENT_ITEMS: per-numa events NR_VM_NODE_STAT_ITEMS: per-numa stats NR_VM_WRITEBACK_STAT_ITEMS: system-wide background-writeback and dirty-throttling tresholds. NR_VM_EVENT_ITEMS: system-wide events ------------------------- Rename NR_VM_WRITEBACK_STAT_ITEMS to NR_VM_STAT_ITEMS, to track the system-wide stats, we are going to add per-page metadata stats to this category in the next patch. Also delete unused writeback_stat_name(). Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Fixes: 15995a352474 ("mm: report per-page metadata information") Signed-off-by: Pasha Tatashin <[email protected]> Suggested-by: Yosry Ahmed <[email protected]> Tested-by: Alison Schofield <[email protected]> Acked-by: David Hildenbrand <[email protected]> Acked-by: David Rientjes <[email protected]> Cc: Dan Williams <[email protected]> Cc: Domenico Cerasuolo <[email protected]> Cc: Joel Granados <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Li Zhijian <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Mike Rapoport <[email protected]> Cc: Muchun Song <[email protected]> Cc: Nhat Pham <[email protected]> Cc: Sourav Panda <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: Yi Zhang <[email protected]> Cc: Fan Ni <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-08-15mm/hugetlb: fix hugetlb vs. core-mm PT lockingDavid Hildenbrand2-3/+41
We recently made GUP's common page table walking code to also walk hugetlb VMAs without most hugetlb special-casing, preparing for the future of having less hugetlb-specific page table walking code in the codebase. Turns out that we missed one page table locking detail: page table locking for hugetlb folios that are not mapped using a single PMD/PUD. Assume we have hugetlb folio that spans multiple PTEs (e.g., 64 KiB hugetlb folios on arm64 with 4 KiB base page size). GUP, as it walks the page tables, will perform a pte_offset_map_lock() to grab the PTE table lock. However, hugetlb that concurrently modifies these page tables would actually grab the mm->page_table_lock: with USE_SPLIT_PTE_PTLOCKS, the locks would differ. Something similar can happen right now with hugetlb folios that span multiple PMDs when USE_SPLIT_PMD_PTLOCKS. This issue can be reproduced [1], for example triggering: [ 3105.936100] ------------[ cut here ]------------ [ 3105.939323] WARNING: CPU: 31 PID: 2732 at mm/gup.c:142 try_grab_folio+0x11c/0x188 [ 3105.944634] Modules linked in: [...] [ 3105.974841] CPU: 31 PID: 2732 Comm: reproducer Not tainted 6.10.0-64.eln141.aarch64 #1 [ 3105.980406] Hardware name: QEMU KVM Virtual Machine, BIOS edk2-20240524-4.fc40 05/24/2024 [ 3105.986185] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 3105.991108] pc : try_grab_folio+0x11c/0x188 [ 3105.994013] lr : follow_page_pte+0xd8/0x430 [ 3105.996986] sp : ffff80008eafb8f0 [ 3105.999346] x29: ffff80008eafb900 x28: ffffffe8d481f380 x27: 00f80001207cff43 [ 3106.004414] x26: 0000000000000001 x25: 0000000000000000 x24: ffff80008eafba48 [ 3106.009520] x23: 0000ffff9372f000 x22: ffff7a54459e2000 x21: ffff7a546c1aa978 [ 3106.014529] x20: ffffffe8d481f3c0 x19: 0000000000610041 x18: 0000000000000001 [ 3106.019506] x17: 0000000000000001 x16: ffffffffffffffff x15: 0000000000000000 [ 3106.024494] x14: ffffb85477fdfe08 x13: 0000ffff9372ffff x12: 0000000000000000 [ 3106.029469] x11: 1fffef4a88a96be1 x10: ffff7a54454b5f0c x9 : ffffb854771b12f0 [ 3106.034324] x8 : 0008000000000000 x7 : ffff7a546c1aa980 x6 : 0008000000000080 [ 3106.038902] x5 : 00000000001207cf x4 : 0000ffff9372f000 x3 : ffffffe8d481f000 [ 3106.043420] x2 : 0000000000610041 x1 : 0000000000000001 x0 : 0000000000000000 [ 3106.047957] Call trace: [ 3106.049522] try_grab_folio+0x11c/0x188 [ 3106.051996] follow_pmd_mask.constprop.0.isra.0+0x150/0x2e0 [ 3106.055527] follow_page_mask+0x1a0/0x2b8 [ 3106.058118] __get_user_pages+0xf0/0x348 [ 3106.060647] faultin_page_range+0xb0/0x360 [ 3106.063651] do_madvise+0x340/0x598 Let's make huge_pte_lockptr() effectively use the same PT locks as any core-mm page table walker would. Add ptep_lockptr() to obtain the PTE page table lock using a pte pointer -- unfortunately we cannot convert pte_lockptr() because virt_to_page() doesn't work with kmap'ed page tables we can have with CONFIG_HIGHPTE. Handle CONFIG_PGTABLE_LEVELS correctly by checking in reverse order, such that when e.g., CONFIG_PGTABLE_LEVELS==2 with PGDIR_SIZE==P4D_SIZE==PUD_SIZE==PMD_SIZE will work as expected. Document why that works. There is one ugly case: powerpc 8xx, whereby we have an 8 MiB hugetlb folio being mapped using two PTE page tables. While hugetlb wants to take the PMD table lock, core-mm would grab the PTE table lock of one of both PTE page tables. In such corner cases, we have to make sure that both locks match, which is (fortunately!) currently guaranteed for 8xx as it does not support SMP and consequently doesn't use split PT locks. [1] https://lore.kernel.org/all/[email protected]/ Link: https://lkml.kernel.org/r/[email protected] Fixes: 9cb28da54643 ("mm/gup: handle hugetlb in the generic follow_page_mask code") Signed-off-by: David Hildenbrand <[email protected]> Acked-by: Peter Xu <[email protected]> Reviewed-by: Baolin Wang <[email protected]> Tested-by: Baolin Wang <[email protected]> Cc: Peter Xu <[email protected]> Cc: Oscar Salvador <[email protected]> Cc: Muchun Song <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>