aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)AuthorFilesLines
2023-08-22cpufreq: amd-pstate-ut: Fix kernel panic when loading the driverSwapnil Sapkal1-8/+16
After loading the amd-pstate-ut driver, amd_pstate_ut_check_perf() and amd_pstate_ut_check_freq() use cpufreq_cpu_get() to get the policy of the CPU and mark it as busy. In these functions, cpufreq_cpu_put() should be used to release the policy, but it is not, so any other entity trying to access the policy is blocked indefinitely. One such scenario is when amd_pstate mode is changed, leading to the following splat: [ 1332.103727] INFO: task bash:2929 blocked for more than 120 seconds. [ 1332.110001] Not tainted 6.5.0-rc2-amd-pstate-ut #5 [ 1332.115315] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1332.123140] task:bash state:D stack:0 pid:2929 ppid:2873 flags:0x00004006 [ 1332.123143] Call Trace: [ 1332.123145] <TASK> [ 1332.123148] __schedule+0x3c1/0x16a0 [ 1332.123154] ? _raw_read_lock_irqsave+0x2d/0x70 [ 1332.123157] schedule+0x6f/0x110 [ 1332.123160] schedule_timeout+0x14f/0x160 [ 1332.123162] ? preempt_count_add+0x86/0xd0 [ 1332.123165] __wait_for_common+0x92/0x190 [ 1332.123168] ? __pfx_schedule_timeout+0x10/0x10 [ 1332.123170] wait_for_completion+0x28/0x30 [ 1332.123173] cpufreq_policy_put_kobj+0x4d/0x90 [ 1332.123177] cpufreq_policy_free+0x157/0x1d0 [ 1332.123178] ? preempt_count_add+0x58/0xd0 [ 1332.123180] cpufreq_remove_dev+0xb6/0x100 [ 1332.123182] subsys_interface_unregister+0x114/0x120 [ 1332.123185] ? preempt_count_add+0x58/0xd0 [ 1332.123187] ? __pfx_amd_pstate_change_driver_mode+0x10/0x10 [ 1332.123190] cpufreq_unregister_driver+0x3b/0xd0 [ 1332.123192] amd_pstate_change_driver_mode+0x1e/0x50 [ 1332.123194] store_status+0xe9/0x180 [ 1332.123197] dev_attr_store+0x1b/0x30 [ 1332.123199] sysfs_kf_write+0x42/0x50 [ 1332.123202] kernfs_fop_write_iter+0x143/0x1d0 [ 1332.123204] vfs_write+0x2df/0x400 [ 1332.123208] ksys_write+0x6b/0xf0 [ 1332.123210] __x64_sys_write+0x1d/0x30 [ 1332.123213] do_syscall_64+0x60/0x90 [ 1332.123216] ? fpregs_assert_state_consistent+0x2e/0x50 [ 1332.123219] ? exit_to_user_mode_prepare+0x49/0x1a0 [ 1332.123223] ? irqentry_exit_to_user_mode+0xd/0x20 [ 1332.123225] ? irqentry_exit+0x3f/0x50 [ 1332.123226] ? exc_page_fault+0x8e/0x190 [ 1332.123228] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 [ 1332.123232] RIP: 0033:0x7fa74c514a37 [ 1332.123234] RSP: 002b:00007ffe31dd0788 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [ 1332.123238] RAX: ffffffffffffffda RBX: 0000000000000008 RCX: 00007fa74c514a37 [ 1332.123239] RDX: 0000000000000008 RSI: 000055e27c447aa0 RDI: 0000000000000001 [ 1332.123241] RBP: 000055e27c447aa0 R08: 00007fa74c5d1460 R09: 000000007fffffff [ 1332.123242] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000008 [ 1332.123244] R13: 00007fa74c61a780 R14: 00007fa74c616600 R15: 00007fa74c615a00 [ 1332.123247] </TASK> Fix this by calling cpufreq_cpu_put() wherever necessary. Fixes: 14eb1c96e3a3 ("cpufreq: amd-pstate: Add test module for amd-pstate driver") Reviewed-by: Mario Limonciello <[email protected]> Reviewed-by: Meng Li <[email protected]> Reviewed-by: Wyes Karny <[email protected]> Suggested-by: Wyes Karny <[email protected]> Signed-off-by: Swapnil Sapkal <[email protected]> [ rjw: Subject and changelog edits ] Signed-off-by: Rafael J. Wysocki <[email protected]>
2023-08-22cpufreq: amd-pstate-ut: Remove module parameter accessSwapnil Sapkal1-20/+2
In amd-pstate-ut, shared memory-based systems call get_shared_mem() as part of amd_pstate_ut_check_enabled() function. This function was written when CONFIG_X86_AMD_PSTATE was tristate config and amd_pstate can be built as a module. Currently CONFIG_X86_AMD_PSTATE is a boolean config and module parameter shared_mem is removed. But amd-pstate-ut code still accesses this module parameter. Remove those accesses. Fixes: 456ca88d8a52 ("cpufreq: amd-pstate: change amd-pstate driver to be built-in type") Reviewed-by: Mario Limonciello <[email protected]> Reviewed-by: Meng Li <[email protected]> Reviewed-by: Wyes Karny <[email protected]> Suggested-by: Wyes Karny <[email protected]> Signed-off-by: Swapnil Sapkal <[email protected]> [ rjw: Subject edits ] Signed-off-by: Rafael J. Wysocki <[email protected]>
2023-08-22cpufreq: Use clamp() helper macro to improve the code readabilityLiao Chang1-11/+3
The valid values of policy.{min, max} should be between 'min' and 'max', so use clamp() helper macro to makes cpufreq_verify_within_limits() easier to follow. Signed-off-by: Liao Chang <[email protected]> [ rjw: Subject and changelog edits ] Signed-off-by: Rafael J. Wysocki <[email protected]>
2023-08-22thermal: intel: intel_soc_dts_iosf: Remove redundant checkZhang Rui1-5/+3
Remove the redundant check in remove_dts_thermal_zone() because all of its existing callers pass a valid pointer as the argument. Signed-off-by: Zhang Rui <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2023-08-22PM: sleep: Add helpers to allow a device to remain powered-onUlf Hansson1-0/+10
On some platforms a device and its corresponding PM domain, may need to remain powered-on during system wide suspend, to support various use cases. For example, when the console_suspend_enabled flag is unset for a serial controller, the corresponding device may need to remain powered on. Other use cases exists too. In fact, we already have the mechanism in the PM core to deal with these kind of use cases. However, the current naming of the corresponding functions/flags clearly suggests these should be use for system wakeup. See device_wakeup_path(), device_set_wakeup_path and dev->power.wakeup_path. As a way to extend the use of the existing mechanism, let's introduce two new helpers functions, device_awake_path() and device_set_awake_path(). At this point, let them act as wrappers of the existing functions. Ideally, when all users have been converted to use the new helpers, we may decide to drop the old ones and rename the flag. Signed-off-by: Ulf Hansson <[email protected]> Reviewed-by: Peng Fan <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2023-08-22thermal: intel: int340x: simplify the code with module_platform_driver()Yang Yingliang1-12/+1
The init/exit() of the driver only calls platform_driver_{un}register(), so it can be simpilfied by using module_platform_driver(). Signed-off-by: Yang Yingliang <[email protected]> [ rjw: Subject and changelog edits ] Signed-off-by: Rafael J. Wysocki <[email protected]>
2023-08-22PM: QoS: Add check to make sure CPU latency is non-negativeClive Lin1-2/+7
CPU latency should never be negative, which will be incorrectly high when converted to unsigned data type. Commit 8d36694245f2 ("PM: QoS: Add check to make sure CPU freq is non-negative") makes sure CPU frequency is non-negative to fix incorrect behavior in freqency QoS. Add an analogous check to make sure CPU latency is non-negative so as to prevent this problem from happening in CPU latency QoS. Signed-off-by: Clive Lin <[email protected]> [ rjw: Changelog edits ] Signed-off-by: Rafael J. Wysocki <[email protected]>
2023-08-22PM: runtime: Remove unsued extern declaration of ↵YueHaibing1-2/+0
pm_runtime_update_max_time_suspended() Commit 76e267d822f2 ("PM / Runtime: Remove device fields related to suspend time, v2") left behind this. Signed-off-by: YueHaibing <[email protected]> Signed-off-by: Rafael J. Wysocki <[email protected]>
2023-08-22thermal/of: Fix potential uninitialized value accessPeng Fan1-4/+4
If of_parse_phandle_with_args() called from __thermal_of_bind() or __thermal_of_unbind() fails, cooling_spec.np will not be initialized, so move the of_node_put() calls below the respective return value checks to avoid dereferencing an uninitialized pointer. Fixes: 3fd6d6e2b4e8 ("thermal/of: Rework the thermal device tree initialization") Signed-off-by: Peng Fan <[email protected]> [ rjw: Subject and changelog edits ] Signed-off-by: Rafael J. Wysocki <[email protected]>
2023-08-22Merge tag 'devicetree-fixes-for-6.5-2' of ↵Linus Torvalds4-27/+15
git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux Pull devicetree fixes from Rob Herring: - Fix DT node refcount when creating platform devices - Fix deadlock in changeset code due to printing with devtree_lock held - Fix unittest EXPECT strings for parse_phandle_with_args_map() test - Fix IMA kexec memblock freeing * tag 'devicetree-fixes-for-6.5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux: of/platform: increase refcount of fwnode of: dynamic: Refactor action prints to not use "%pOF" inside devtree_lock of: unittest: Fix EXPECT for parse_phandle_with_args_map() test mm,ima,kexec,of: use memblock_free_late from ima_free_kexec_buffer
2023-08-22sfc: allocate a big enough SKB for loopback selftest packetEdward Cree3-3/+3
Cited commits passed a size to alloc_skb that was only big enough for the actual packet contents, but the following skb_put + memcpy writes the whole struct efx_loopback_payload including leading and trailing padding bytes (which are then stripped off with skb_pull/skb_trim). This could cause an skb_over_panic, although in practice we get saved by kmalloc_size_roundup. Pass the entire size we use, instead of the size of the final packet. Reported-by: Andy Moreton <[email protected]> Fixes: cf60ed469629 ("sfc: use padding to fix alignment in loopback test") Fixes: 30c24dd87f3f ("sfc: siena: use padding to fix alignment in loopback test") Fixes: 1186c6b31ee1 ("sfc: falcon: use padding to fix alignment in loopback test") Signed-off-by: Edward Cree <[email protected]> Reviewed-by: Simon Horman <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2023-08-22Merge tag 'wireless-2023-08-22' of ↵Jakub Kicinski3-2/+12
git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless Johannes Berg says: ==================== Two fixes: - reorder buffer filter checks can cause bad shift/UBSAN warning with newer HW, avoid the check (mac80211) - add Kconfig dependency for iwlwifi for PTP clock usage * tag 'wireless-2023-08-22' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless: wifi: mac80211: limit reorder_buf_filtered to avoid UBSAN warning wifi: iwlwifi: mvm: add dependency for PTP clock ==================== Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2023-08-22leds: trigger: netdev: rename 'hw_control' sysfs entry to 'offloaded'Marek Behún2-14/+14
Commit b655892ffd6d ("leds: trigger: netdev: expose hw_control status via sysfs") exposed to sysfs the flag that tells whether the LED trigger is offloaded to hardware, under the name "hw_control", since that is the name under which this setting is called in the code. Everywhere else in kernel when some work that is normally done in software can be made to be done by hardware instead, we use the word "offloading" to describe this, e.g. "LED blinking is offloaded to hardware". Normally renaming sysfs entries is a no-go because of backwards compatibility. But since this patch was not yet released in a stable kernel, I think it is still possible to rename it, if there is consensus. Fixes: b655892ffd6d ("leds: trigger: netdev: expose hw_control status via sysfs") Signed-off-by: Marek Behún <[email protected]> Reviewed-by: Andrew Lunn <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2023-08-22net: ethernet: mtk_eth_soc: fix NULL pointer on hw resetDaniel Golle1-2/+10
When a hardware reset is triggered on devices not initializing WED the calls to mtk_wed_fe_reset and mtk_wed_fe_reset_complete dereference a pointer on uninitialized stack memory. Break out of both functions in case a hw_list entry is 0. Fixes: 08a764a7c51b ("net: ethernet: mtk_wed: add reset/reset_complete callbacks") Signed-off-by: Daniel Golle <[email protected]> Reviewed-by: Simon Horman <[email protected]> Acked-by: Lorenzo Bianconi <[email protected]> Link: https://lore.kernel.org/r/5465c1609b464cc7407ae1530c40821dcdf9d3e6.1692634266.git.daniel@makrotopia.org Signed-off-by: Jakub Kicinski <[email protected]>
2023-08-22Merge tag 'nfs-for-6.5-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds5-23/+35
Pull NFS client fixes from Trond Myklebust: - fix a use after free in nfs_direct_join_group() (Cc: stable) - fix sysfs server name memory leak - fix lock recovery hang in NFSv4.0 - fix page free in the error path for nfs42_proc_getxattr() and __nfs4_get_acl_uncached() - SUNRPC/rdma: fix receive buffer dma-mapping after a server disconnect * tag 'nfs-for-6.5-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: xprtrdma: Remap Receive buffers after a reconnect NFSv4: fix out path in __nfs4_get_acl_uncached NFSv4.2: fix error handling in nfs42_proc_getxattr NFS: Fix sysfs server name memory leak NFS: Fix a use after free in nfs_direct_join_group() NFSv4: Fix dropped lock for racing OPEN and delegation return
2023-08-22Merge tag 'selinux-pr-20230821' of ↵Linus Torvalds1-1/+1
git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux Pull selinux fix from Paul Moore: "A small fix for a potential problem when cleaning up after a failed SELinux policy load (list next pointer not being properly initialized to NULL early enough)" * tag 'selinux-pr-20230821' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux: selinux: set next pointer before attaching to list
2023-08-22drm/i915/dgfx: Enable d3cold at s2idleAnshuman Gupta1-15/+18
System wide suspend already has support for lmem save/restore during suspend therefore enabling d3cold for s2idle and keepng it disable for runtime PM.(Refer below commit for d3cold runtime PM disable justification) 'commit 66eb93e71a7a ("drm/i915/dgfx: Keep PCI autosuspend control 'on' by default on all dGPU")' It will reduce the DG2 Card power consumption to ~0 Watt for s2idle power KPI. v2: - Added "Cc: [email protected]". Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/8755 Cc: [email protected] Cc: Rodrigo Vivi <[email protected]> Signed-off-by: Anshuman Gupta <[email protected]> Reviewed-by: Rodrigo Vivi <[email protected]> Tested-by: Aaron Ma <[email protected]> Tested-by: Jianshui Yu <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] (cherry picked from commit 2643e6d1f2a5e51877be24042d53cf956589be10) Signed-off-by: Rodrigo Vivi <[email protected]>
2023-08-22affs: rename local toupper() to fn() to avoid confusionAndy Shevchenko1-10/+10
A compiler may see the collision with the toupper() defined in ctype.h: fs/affs/namei.c:159:19: warning: unused variable 'toupper' [-Wunused-variable] 159 | toupper_t toupper = affs_get_toupper(sb); To prevent this from happening, rename toupper local variable to fn. Initially this had been introduced by 24579a881513 ("v2.4.3.5 -> v2.4.3.6") in the history.git by history group. Reported-by: kernel test robot <[email protected]> Signed-off-by: Andy Shevchenko <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2023-08-22affs: remove writepage implementationMatthew Wilcox (Oracle)1-5/+9
If the filesystem implements migrate_folio and writepages, there is no need for a writepage implementation. Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Reviewed-by: David Sterba <[email protected]> Signed-off-by: David Sterba <[email protected]>
2023-08-22btrfs: zoned: skip splitting and logical rewriting on pre-alloc writeNaohiro Aota1-4/+15
When doing a relocation, there is a chance that at the time of btrfs_reloc_clone_csums(), there is no checksum for the corresponding region. In this case, btrfs_finish_ordered_zoned()'s sum points to an invalid item and so ordered_extent's logical is set to some invalid value. Then, btrfs_lookup_block_group() in btrfs_zone_finish_endio() failed to find a block group and will hit an assert or a null pointer dereference as following. This can be reprodcued by running btrfs/028 several times (e.g, 4 to 16 times) with a null_blk setup. The device's zone size and capacity is set to 32 MB and the storage size is set to 5 GB on my setup. KASAN: null-ptr-deref in range [0x0000000000000088-0x000000000000008f] CPU: 6 PID: 3105720 Comm: kworker/u16:13 Tainted: G W 6.5.0-rc6-kts+ #1 Hardware name: Supermicro Super Server/X10SRL-F, BIOS 2.0 12/17/2015 Workqueue: btrfs-endio-write btrfs_work_helper [btrfs] RIP: 0010:btrfs_zone_finish_endio.part.0+0x34/0x160 [btrfs] Code: 41 54 49 89 fc 55 48 89 f5 53 e8 57 7d fc ff 48 8d b8 88 00 00 00 48 89 c3 48 b8 00 00 00 00 00 > 3c 02 00 0f 85 02 01 00 00 f6 83 88 00 00 00 01 0f 84 a8 00 00 RSP: 0018:ffff88833cf87b08 EFLAGS: 00010206 RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 0000000000000000 RDX: 0000000000000011 RSI: 0000000000000004 RDI: 0000000000000088 RBP: 0000000000000002 R08: 0000000000000001 R09: ffffed102877b827 R10: ffff888143bdc13b R11: ffff888125b1cbc0 R12: ffff888143bdc000 R13: 0000000000007000 R14: ffff888125b1cba8 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff88881e500000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f3ed85223d5 CR3: 00000001519b4005 CR4: 00000000001706e0 Call Trace: <TASK> ? die_addr+0x3c/0xa0 ? exc_general_protection+0x148/0x220 ? asm_exc_general_protection+0x22/0x30 ? btrfs_zone_finish_endio.part.0+0x34/0x160 [btrfs] ? btrfs_zone_finish_endio.part.0+0x19/0x160 [btrfs] btrfs_finish_one_ordered+0x7b8/0x1de0 [btrfs] ? rcu_is_watching+0x11/0xb0 ? lock_release+0x47a/0x620 ? btrfs_finish_ordered_zoned+0x59b/0x800 [btrfs] ? __pfx_btrfs_finish_one_ordered+0x10/0x10 [btrfs] ? btrfs_finish_ordered_zoned+0x358/0x800 [btrfs] ? __smp_call_single_queue+0x124/0x350 ? rcu_is_watching+0x11/0xb0 btrfs_work_helper+0x19f/0xc60 [btrfs] ? __pfx_try_to_wake_up+0x10/0x10 ? _raw_spin_unlock_irq+0x24/0x50 ? rcu_is_watching+0x11/0xb0 process_one_work+0x8c1/0x1430 ? __pfx_lock_acquire+0x10/0x10 ? __pfx_process_one_work+0x10/0x10 ? __pfx_do_raw_spin_lock+0x10/0x10 ? _raw_spin_lock_irq+0x52/0x60 worker_thread+0x100/0x12c0 ? __kthread_parkme+0xc1/0x1f0 ? __pfx_worker_thread+0x10/0x10 kthread+0x2ea/0x3c0 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x30/0x70 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1b/0x30 </TASK> On the zoned mode, writing to pre-allocated region means data relocation write. Such write always uses WRITE command so there is no need of splitting and rewriting logical address. Thus, we can just skip the function for the case. Fixes: cbfce4c7fbde ("btrfs: optimize the logical to physical mapping for zoned writes") Signed-off-by: Naohiro Aota <[email protected]> Signed-off-by: David Sterba <[email protected]>
2023-08-22super: use higher-level helper for {freeze,thaw}Christian Brauner1-3/+12
It's not necessary to use low-level locking helpers here. Use the higher-level locking helpers and log if the superblock is dying. Since the caller is assumed to already hold an active reference it isn't possible to observe a dying superblock. Suggested-by: Jan Kara <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
2023-08-22Merge ACPI thermal driver changes for 6.6-rc1.Rafael J. Wysocki13-304/+385
This reworks the ACPI thermal driver to use a table of generic trip point structures on top of the internal representation of trip points and removes thermal zone callbacks that are not necessary any more from it. It requires some relatively small changes to be made in the thermal core too and it is based on top of changes reworking ACPI device notification handling that are included in this merge. * acpi-thermal: (24 commits) ACPI: thermal: Eliminate code duplication from acpi_thermal_notify() ACPI: thermal: Drop unnecessary thermal zone callbacks ACPI: thermal: Rework thermal_get_trend() ACPI: thermal: Use trip point table to register thermal zones thermal: core: Rework and rename __for_each_thermal_trip() ACPI: thermal: Introduce struct acpi_thermal_trip ACPI: thermal: Carry out trip point updates under zone lock ACPI: thermal: Clean up acpi_thermal_register_thermal_zone() thermal: core: Add priv pointer to struct thermal_trip thermal: core: Introduce thermal_zone_device_exec() thermal: core: Do not handle trip points with invalid temperature ACPI: thermal: Drop redundant local variable from acpi_thermal_resume() ACPI: thermal: Do not attach private data to ACPI handles ACPI: thermal: Drop enabled flag from struct acpi_thermal_active ACPI: thermal: Drop nocrt parameter ACPI: thermal: Install Notify() handler directly ACPI: NFIT: Remove unnecessary .remove callback ACPI: NFIT: Install Notify() handler directly ACPI: HED: Install Notify() handler directly ACPI: battery: Install Notify() handler directly ...
2023-08-22xen: privcmd: Add support for irqfdViresh Kumar3-2/+301
Xen provides support for injecting interrupts to the guests via the HYPERVISOR_dm_op() hypercall. The same is used by the Virtio based device backend implementations, in an inefficient manner currently. Generally, the Virtio backends are implemented to work with the Eventfd based mechanism. In order to make such backends work with Xen, another software layer needs to poll the Eventfds and raise an interrupt to the guest using the Xen based mechanism. This results in an extra context switch. This is not a new problem in Linux though. It is present with other hypervisors like KVM, etc. as well. The generic solution implemented in the kernel for them is to provide an IOCTL call to pass the interrupt details and eventfd, which lets the kernel take care of polling the eventfd and raising of the interrupt, instead of handling this in user space (which involves an extra context switch). This patch adds support to inject a specific interrupt to guest using the eventfd mechanism, by preventing the extra context switch. Inspired by existing implementations for KVM, etc.. Signed-off-by: Viresh Kumar <[email protected]> Reviewed-by: Juergen Gross <[email protected]> Link: https://lore.kernel.org/r/8e724ac1f50c2bc1eb8da9b3ff6166f1372570aa.1692697321.git.viresh.kumar@linaro.org Signed-off-by: Juergen Gross <[email protected]>
2023-08-22tmpfs,xattr: GFP_KERNEL_ACCOUNT for simple xattrsHugh Dickins2-3/+3
It is particularly important for the userns mount case (when a sensible nr_inodes maximum may not be enforced) that tmpfs user xattrs be subject to memory cgroup limiting. Leave temporary buffer allocations as is, but change the persistent simple xattr allocations from GFP_KERNEL to GFP_KERNEL_ACCOUNT. This limits kernfs's cgroupfs too, but that's good. (I had intended to send this change earlier, but had been confused by shmem_alloc_inode() using GFP_KERNEL, and thought a discussion would be needed to change that too: no, I was forgetting the SLAB_ACCOUNT on that kmem_cache, which implicitly adds __GFP_ACCOUNT to all its allocations.) Signed-off-by: Hugh Dickins <[email protected]> Reviewed-by: Jan Kara <[email protected]> Message-Id: <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
2023-08-22efi/runtime-wrappers: Clean up white space and add __init annotationArd Biesheuvel1-22/+19
Some cosmetic changes as well as a missing __init annotation. Signed-off-by: Ard Biesheuvel <[email protected]>
2023-08-22acpi/prmt: Use EFI runtime sandbox to invoke PRM handlersArd Biesheuvel4-4/+40
Instead of bypassing the kernel's adaptation layer for performing EFI runtime calls, wire up ACPI PRM handling into it. This means these calls can no longer occur concurrently with EFI runtime calls, and will be made from the EFI runtime workqueue. It also means any page faults occurring during PRM handling will be identified correctly as originating in firmware code. Acked-by: Rafael J. Wysocki <[email protected]> Signed-off-by: Ard Biesheuvel <[email protected]>
2023-08-22efi/runtime-wrappers: Don't duplicate setup/teardown codeArd Biesheuvel2-10/+22
Avoid duplicating the EFI arch setup and teardown routine calls numerous times in efi_call_rts(). Instead, expand the efi_call_virt_pointer() macro into efi_call_rts(), taking the pre and post parts out of the switch. Signed-off-by: Ard Biesheuvel <[email protected]>
2023-08-22efi/runtime-wrappers: Remove duplicated macro for service returning voidArd Biesheuvel4-24/+14
__efi_call_virt() exists as an alternative for efi_call_virt() for the sole reason that ResetSystem() returns void, and so we cannot use a call to it in the RHS of an assignment. Given that there is only a single user, let's drop the macro, and expand it into the caller. That way, the remaining macro can be tightened somewhat in terms of type safety too. Note that the use of typeof() on the runtime service invocation does not result in an actual call being made, but it does require a few pointer types to be fixed up and converted into the proper function pointer prototypes. Signed-off-by: Ard Biesheuvel <[email protected]>
2023-08-22drm/display/dp: Fix the DP DSC Receiver cap sizeAnkit Nautiyal1-1/+1
DP DSC Receiver Capabilities are exposed via DPCD 60h-6Fh. Fix the DSC RECEIVER CAP SIZE accordingly. Fixes: ffddc4363c28 ("drm/dp: Add DP DSC DPCD receiver capability size define and missing SHIFT") Cc: Anusha Srivatsa <[email protected]> Cc: Manasi Navare <[email protected]> Cc: <[email protected]> # v5.0+ Signed-off-by: Ankit Nautiyal <[email protected]> Reviewed-by: Stanislav Lisovskiy <[email protected]> Signed-off-by: Jani Nikula <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected]
2023-08-22xen/xenbus: Avoid a lockdep warning when adding a watchPetr Pavlu1-2/+2
The following lockdep warning appears during boot on a Xen dom0 system: [ 96.388794] ====================================================== [ 96.388797] WARNING: possible circular locking dependency detected [ 96.388799] 6.4.0-rc5-default+ #8 Tainted: G EL [ 96.388803] ------------------------------------------------------ [ 96.388804] xenconsoled/1330 is trying to acquire lock: [ 96.388808] ffffffff82acdd10 (xs_watch_rwsem){++++}-{3:3}, at: register_xenbus_watch+0x45/0x140 [ 96.388847] but task is already holding lock: [ 96.388849] ffff888100c92068 (&u->msgbuffer_mutex){+.+.}-{3:3}, at: xenbus_file_write+0x2c/0x600 [ 96.388862] which lock already depends on the new lock. [ 96.388864] the existing dependency chain (in reverse order) is: [ 96.388866] -> #2 (&u->msgbuffer_mutex){+.+.}-{3:3}: [ 96.388874] __mutex_lock+0x85/0xb30 [ 96.388885] xenbus_dev_queue_reply+0x48/0x2b0 [ 96.388890] xenbus_thread+0x1d7/0x950 [ 96.388897] kthread+0xe7/0x120 [ 96.388905] ret_from_fork+0x2c/0x50 [ 96.388914] -> #1 (xs_response_mutex){+.+.}-{3:3}: [ 96.388923] __mutex_lock+0x85/0xb30 [ 96.388930] xenbus_backend_ioctl+0x56/0x1c0 [ 96.388935] __x64_sys_ioctl+0x90/0xd0 [ 96.388942] do_syscall_64+0x5c/0x90 [ 96.388950] entry_SYSCALL_64_after_hwframe+0x72/0xdc [ 96.388957] -> #0 (xs_watch_rwsem){++++}-{3:3}: [ 96.388965] __lock_acquire+0x1538/0x2260 [ 96.388972] lock_acquire+0xc6/0x2b0 [ 96.388976] down_read+0x2d/0x160 [ 96.388983] register_xenbus_watch+0x45/0x140 [ 96.388990] xenbus_file_write+0x53d/0x600 [ 96.388994] vfs_write+0xe4/0x490 [ 96.389003] ksys_write+0xb8/0xf0 [ 96.389011] do_syscall_64+0x5c/0x90 [ 96.389017] entry_SYSCALL_64_after_hwframe+0x72/0xdc [ 96.389023] other info that might help us debug this: [ 96.389025] Chain exists of: xs_watch_rwsem --> xs_response_mutex --> &u->msgbuffer_mutex [ 96.413429] Possible unsafe locking scenario: [ 96.413430] CPU0 CPU1 [ 96.413430] ---- ---- [ 96.413431] lock(&u->msgbuffer_mutex); [ 96.413432] lock(xs_response_mutex); [ 96.413433] lock(&u->msgbuffer_mutex); [ 96.413434] rlock(xs_watch_rwsem); [ 96.413436] *** DEADLOCK *** [ 96.413436] 1 lock held by xenconsoled/1330: [ 96.413438] #0: ffff888100c92068 (&u->msgbuffer_mutex){+.+.}-{3:3}, at: xenbus_file_write+0x2c/0x600 [ 96.413446] An ioctl call IOCTL_XENBUS_BACKEND_SETUP (record #1 in the report) results in calling xenbus_alloc() -> xs_suspend() which introduces ordering xs_watch_rwsem --> xs_response_mutex. The xenbus_thread() operation (record #2) creates xs_response_mutex --> &u->msgbuffer_mutex. An XS_WATCH write to the xenbus file then results in a complain about the opposite lock order &u->msgbuffer_mutex --> xs_watch_rwsem. The dependency xs_watch_rwsem --> xs_response_mutex is spurious. Avoid it and the warning by changing the ordering in xs_suspend(), first acquire xs_response_mutex and then xs_watch_rwsem. Reverse also the unlocking order in xs_suspend_cancel() for consistency, but keep xs_resume() as is because it needs to have xs_watch_rwsem unlocked only after exiting xs suspend and re-adding all watches. Signed-off-by: Petr Pavlu <[email protected]> Reviewed-by: Juergen Gross <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Juergen Gross <[email protected]>
2023-08-21tg3: Use slab_build_skb() when neededKees Cook1-1/+4
The tg3 driver will use kmalloc() under some conditions. Check the frag_size and use slab_build_skb() when frag_size is 0. Silences the warning introduced by commit ce098da1497c ("skbuff: Introduce slab_build_skb()"): Use slab_build_skb() instead ... tg3_poll_work+0x638/0xf90 [tg3] Fixes: ce098da1497c ("skbuff: Introduce slab_build_skb()") Reported-by: Fiona Ebner <[email protected]> Closes: https://lore.kernel.org/all/[email protected] Cc: Siva Reddy Kallam <[email protected]> Cc: Prashant Sreedharan <[email protected]> Cc: Michael Chan <[email protected]> Cc: Bagas Sanjaya <[email protected]> Signed-off-by: Kees Cook <[email protected]> Reviewed-by: Pavan Chebbi <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2023-08-21selftests: bonding: do not set port down before adding to bondHangbin Liu1-2/+2
Before adding a port to bond, it need to be set down first. In the lacpdu test the author set the port down specifically. But commit a4abfa627c38 ("net: rtnetlink: Enslave device before bringing it up") changed the operation order, the kernel will set the port down _after_ adding to bond. So all the ports will be down at last and the test failed. In fact, the veth interfaces are already inactive when added. This means there's no need to set them down again before adding to the bond. Let's just remove the link down operation. Fixes: a4abfa627c38 ("net: rtnetlink: Enslave device before bringing it up") Reported-by: Zhengchao Shao <[email protected]> Closes: https://lore.kernel.org/netdev/[email protected]/ Signed-off-by: Hangbin Liu <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2023-08-21samples: ftrace: Replace bti assembly with hint for older compilerGONG, Ruiqi5-7/+7
When cross-building the arm64 kernel with allmodconfig using GCC 9.4, the following error occurs on multiple files under samples/ftrace/: /tmp/ccPC1ODs.s: Assembler messages: /tmp/ccPC1ODs.s:8: Error: selected processor does not support `bti c' Fix this issue by replacing `bti c` with `hint 34`, which is compatible for the older compiler. Link: https://lore.kernel.org/linux-trace-kernel/[email protected] Cc: Masami Hiramatsu <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Florent Revest <[email protected]> Fixes: 8c3526fb86060cb5 ("arm64: ftrace: Add direct call trampoline samples support") Acked-by: Mark Rutland <[email protected]> Signed-off-by: GONG, Ruiqi <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2023-08-21scsi: ufs: ufs-qcom: Clear qunipro_g4_sel for HW major version > 5Neil Armstrong1-1/+1
The qunipro_g4_sel clear is also needed for new platforms with major version > 5. Fix the version check to take this into account. Fixes: 9c02aa24bf40 ("scsi: ufs: ufs-qcom: Clear qunipro_g4_sel for HW version major 5") Acked-by: Manivannan Sadhasivam <[email protected]> Reviewed-by: Nitin Rawat <[email protected]> Signed-off-by: Neil Armstrong <[email protected]> Link: https://lore.kernel.org/r/20230821-topic-sm8x50-upstream-ufs-major-5-plus-v2-1-f42a4b712e58@linaro.org Reviewed-by: "Bao D. Nguyen" <[email protected]> Signed-off-by: Martin K. Petersen <[email protected]>
2023-08-21scsi: ufs: mcq: Fix the search/wrap around logicBao D. Nguyen1-2/+4
The search and wrap around logic in the ufshcd_mcq_sqe_search() function does not work correctly when the hwq's queue depth is not a power of two number. Correct it so that any queue depth with a positive integer value within the supported range would work. Signed-off-by: "Bao D. Nguyen" <[email protected]> Link: https://lore.kernel.org/r/ff49c15be205135ed3ec186f3086694c02867dbd.1692149603.git.quic_nguyenb@quicinc.com Reviewed-by: Bart Van Assche <[email protected]> Fixes: 8d7290348992 ("scsi: ufs: mcq: Add supporting functions for MCQ abort") Signed-off-by: Martin K. Petersen <[email protected]>
2023-08-21of/platform: increase refcount of fwnodePeng Fan1-2/+2
commit 0f8e5651095b ("of/platform: Propagate firmware node by calling device_set_node()") use of_fwnode_handle to replace of_node_get, which introduces a side effect that the refcount is not increased. Then the out of tree jailhouse hypervisor enable/disable test will trigger kernel dump in of_overlay_remove, with the following sequence " of_changeset_revert(&overlay_changeset); of_changeset_destroy(&overlay_changeset); of_overlay_remove(&overlay_id); " So increase the refcount to avoid issues. This patch also release the refcount when releasing amba device to avoid refcount leakage. Fixes: 0f8e5651095b ("of/platform: Propagate firmware node by calling device_set_node()") Reviewed-by: Andy Shevchenko <[email protected]> Signed-off-by: Peng Fan <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Rob Herring <[email protected]>
2023-08-21mm: multi-gen LRU: don't spin during memcg releaseT.J. Mercier1-5/+8
When a memcg is in the process of being released mem_cgroup_tryget will fail because its reference count has already reached 0. This can happen during reclaim if the memcg has already been offlined, and we reclaim all remaining pages attributed to the offlined memcg. shrink_many attempts to skip the empty memcg in this case, and continue reclaiming from the remaining memcgs in the old generation. If there is only one memcg remaining, or if all remaining memcgs are in the process of being released then shrink_many will spin until all memcgs have finished being released. The release occurs through a workqueue, so it can take a while before kswapd is able to make any further progress. This fix results in reductions in kswapd activity and direct reclaim in a test where 28 apps (working set size > total memory) are repeatedly launched in a random sequence: A B delta ratio(%) allocstall_movable 5962 3539 -2423 -40.64 allocstall_normal 2661 2417 -244 -9.17 kswapd_high_wmark_hit_quickly 53152 7594 -45558 -85.71 pageoutrun 57365 11750 -45615 -79.52 Link: https://lkml.kernel.org/r/[email protected] Fixes: e4dde56cd208 ("mm: multi-gen LRU: per-node lru_gen_folio lists") Signed-off-by: T.J. Mercier <[email protected]> Acked-by: Yu Zhao <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-21mm: memory-failure: fix unexpected return value in soft_offline_page()Miaohe Lin1-4/+7
When page_handle_poison() fails to handle the hugepage or free page in retry path, soft_offline_page() will return 0 while -EBUSY is expected in this case. Consequently the user will think soft_offline_page succeeds while it in fact failed. So the user will not try again later in this case. Link: https://lkml.kernel.org/r/[email protected] Fixes: b94e02822deb ("mm,hwpoison: try to narrow window race for free pages") Signed-off-by: Miaohe Lin <[email protected]> Acked-by: Naoya Horiguchi <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-21radix tree: remove unused variableArnd Bergmann1-1/+0
Recent versions of clang warn about an unused variable, though older versions saw the 'slot++' as a use and did not warn: radix-tree.c:1136:50: error: parameter 'slot' set but not used [-Werror,-Wunused-but-set-parameter] It's clearly not needed any more, so just remove it. Link: https://lkml.kernel.org/r/[email protected] Fixes: 3a08cd52c37c7 ("radix tree: Remove multiorder support") Signed-off-by: Arnd Bergmann <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Nathan Chancellor <[email protected]> Cc: Nick Desaulniers <[email protected]> Cc: Peng Zhang <[email protected]> Cc: Rong Tao <[email protected]> Cc: Tom Rix <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-21mm: add a call to flush_cache_vmap() in vmap_pfn()Alexandre Ghiti1-0/+4
flush_cache_vmap() must be called after new vmalloc mappings are installed in the page table in order to allow architectures to make sure the new mapping is visible. It could lead to a panic since on some architectures (like powerpc), the page table walker could see the wrong pte value and trigger a spurious page fault that can not be resolved (see commit f1cb8f9beba8 ("powerpc/64s/radix: avoid ptesync after set_pte and ptep_set_access_flags")). But actually the patch is aiming at riscv: the riscv specification allows the caching of invalid entries in the TLB, and since we recently removed the vmalloc page fault handling, we now need to emit a tlb shootdown whenever a new vmalloc mapping is emitted (https://lore.kernel.org/linux-riscv/[email protected]/). That's a temporary solution, there are ways to avoid that :) Link: https://lkml.kernel.org/r/[email protected] Fixes: 3e9a9e256b1e ("mm: add a vmap_pfn function") Reported-by: Dylan Jhong <[email protected]> Closes: https://lore.kernel.org/linux-riscv/[email protected]/ Signed-off-by: Alexandre Ghiti <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Palmer Dabbelt <[email protected]> Acked-by: Palmer Dabbelt <[email protected]> Reviewed-by: Dylan Jhong <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-21selftests/mm: FOLL_LONGTERM need to be updated to 0x100Ayush Jain1-1/+6
After commit 2c2241081f7d ("mm/gup: move private gup FOLL_ flags to internal.h") FOLL_LONGTERM flag value got updated from 0x10000 to 0x100 at include/linux/mm_types.h. As hmm.hmm_device_private.hmm_gup_test uses FOLL_LONGTERM Updating same here as well. Before this change test goes in an infinite assert loop in hmm.hmm_device_private.hmm_gup_test ========================================================== RUN hmm.hmm_device_private.hmm_gup_test ... hmm-tests.c:1962:hmm_gup_test:Expected HMM_DMIRROR_PROT_WRITE.. ..(2) == m[2] (34) hmm-tests.c:157:hmm_gup_test:Expected ret (-1) == 0 (0) hmm-tests.c:157:hmm_gup_test:Expected ret (-1) == 0 (0) ... ========================================================== Call Trace: <TASK> ? sched_clock+0xd/0x20 ? __lock_acquire.constprop.0+0x120/0x6c0 ? ktime_get+0x2c/0xd0 ? sched_clock+0xd/0x20 ? local_clock+0x12/0xd0 ? lock_release+0x26e/0x3b0 pin_user_pages_fast+0x4c/0x70 gup_test_ioctl+0x4ff/0xbb0 ? gup_test_ioctl+0x68c/0xbb0 __x64_sys_ioctl+0x99/0xd0 do_syscall_64+0x60/0x90 ? syscall_exit_to_user_mode+0x2a/0x50 ? do_syscall_64+0x6d/0x90 ? syscall_exit_to_user_mode+0x2a/0x50 ? do_syscall_64+0x6d/0x90 ? irqentry_exit_to_user_mode+0xd/0x20 ? irqentry_exit+0x3f/0x50 ? exc_page_fault+0x96/0x200 entry_SYSCALL_64_after_hwframe+0x72/0xdc RIP: 0033:0x7f6aaa31aaff After this change test is able to pass successfully. Link: https://lkml.kernel.org/r/[email protected] Fixes: 2c2241081f7d ("mm/gup: move private gup FOLL_ flags to internal.h") Signed-off-by: Ayush Jain <[email protected]> Reviewed-by: Raghavendra K T <[email protected]> Reviewed-by: John Hubbard <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-21nilfs2: fix general protection fault in nilfs_lookup_dirty_data_buffers()Ryusuke Konishi1-0/+5
A syzbot stress test reported that create_empty_buffers() called from nilfs_lookup_dirty_data_buffers() can cause a general protection fault. Analysis using its reproducer revealed that the back reference "mapping" from a page/folio has been changed to NULL after dirty page/folio gang lookup in nilfs_lookup_dirty_data_buffers(). Fix this issue by excluding pages/folios from being collected if, after acquiring a lock on each page/folio, its back reference "mapping" differs from the pointer to the address space struct that held the page/folio. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Ryusuke Konishi <[email protected]> Reported-by: [email protected] Closes: https://lkml.kernel.org/r/[email protected] Tested-by: Ryusuke Konishi <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-21mm/gup: handle cont-PTE hugetlb pages correctly in gup_must_unshare() via ↵David Hildenbrand1-0/+10
GUP-fast In contrast to most other GUP code, GUP-fast common page table walking code like gup_pte_range() also handles hugetlb pages. But in contrast to other hugetlb page table walking code, it does not look at the hugetlb PTE abstraction whereby we have only a single logical hugetlb PTE per hugetlb page, even when using multiple cont-PTEs underneath -- which is for example what huge_ptep_get() abstracts. So when we have a hugetlb page that is mapped via cont-PTEs, GUP-fast might stumble over a PTE that does not map the head page of a hugetlb page -- not the first "head" PTE of such a cont mapping. Logically, the whole hugetlb page is mapped (entire_mapcount == 1), but we might end up calling gup_must_unshare() with a tail page of a hugetlb page. We only maintain a single PageAnonExclusive flag per hugetlb page (as hugetlb pages cannot get partially COW-shared), stored for the head page. That flag is clear for all tail pages. So when gup_must_unshare() ends up calling PageAnonExclusive() with a tail page of a hugetlb page: 1) With CONFIG_DEBUG_VM_PGFLAGS Stumbles over the: VM_BUG_ON_PGFLAGS(PageHuge(page) && !PageHead(page), page); For example, when executing the COW selftests with 64k hugetlb pages on arm64: [ 61.082187] page:00000000829819ff refcount:3 mapcount:1 mapping:0000000000000000 index:0x1 pfn:0x11ee11 [ 61.082842] head:0000000080f79bf7 order:4 entire_mapcount:1 nr_pages_mapped:0 pincount:2 [ 61.083384] anon flags: 0x17ffff80003000e(referenced|uptodate|dirty|head|mappedtodisk|node=0|zone=2|lastcpupid=0xfffff) [ 61.084101] page_type: 0xffffffff() [ 61.084332] raw: 017ffff800000000 fffffc00037b8401 0000000000000402 0000000200000000 [ 61.084840] raw: 0000000000000010 0000000000000000 00000000ffffffff 0000000000000000 [ 61.085359] head: 017ffff80003000e ffffd9e95b09b788 ffffd9e95b09b788 ffff0007ff63cf71 [ 61.085885] head: 0000000000000000 0000000000000002 00000003ffffffff 0000000000000000 [ 61.086415] page dumped because: VM_BUG_ON_PAGE(PageHuge(page) && !PageHead(page)) [ 61.086914] ------------[ cut here ]------------ [ 61.087220] kernel BUG at include/linux/page-flags.h:990! [ 61.087591] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP [ 61.087999] Modules linked in: ... [ 61.089404] CPU: 0 PID: 4612 Comm: cow Kdump: loaded Not tainted 6.5.0-rc4+ #3 [ 61.089917] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015 [ 61.090409] pstate: 604000c5 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 61.090897] pc : gup_must_unshare.part.0+0x64/0x98 [ 61.091242] lr : gup_must_unshare.part.0+0x64/0x98 [ 61.091592] sp : ffff8000825eb940 [ 61.091826] x29: ffff8000825eb940 x28: 0000000000000000 x27: fffffc00037b8440 [ 61.092329] x26: 0400000000000001 x25: 0000000000080101 x24: 0000000000080000 [ 61.092835] x23: 0000000000080100 x22: ffff0000cffb9588 x21: ffff0000c8ec6b58 [ 61.093341] x20: 0000ffffad6b1000 x19: fffffc00037b8440 x18: ffffffffffffffff [ 61.093850] x17: 2864616548656761 x16: 5021202626202965 x15: 6761702865677548 [ 61.094358] x14: 6567615028454741 x13: 2929656761702864 x12: 6165486567615021 [ 61.094858] x11: 00000000ffff7fff x10: 00000000ffff7fff x9 : ffffd9e958b7a1c0 [ 61.095359] x8 : 00000000000bffe8 x7 : c0000000ffff7fff x6 : 00000000002bffa8 [ 61.095873] x5 : ffff0008bb19e708 x4 : 0000000000000000 x3 : 0000000000000000 [ 61.096380] x2 : 0000000000000000 x1 : ffff0000cf6636c0 x0 : 0000000000000046 [ 61.096894] Call trace: [ 61.097080] gup_must_unshare.part.0+0x64/0x98 [ 61.097392] gup_pte_range+0x3a8/0x3f0 [ 61.097662] gup_pgd_range+0x1ec/0x280 [ 61.097942] lockless_pages_from_mm+0x64/0x1a0 [ 61.098258] internal_get_user_pages_fast+0xe4/0x1d0 [ 61.098612] pin_user_pages_fast+0x58/0x78 [ 61.098917] pin_longterm_test_start+0xf4/0x2b8 [ 61.099243] gup_test_ioctl+0x170/0x3b0 [ 61.099528] __arm64_sys_ioctl+0xa8/0xf0 [ 61.099822] invoke_syscall.constprop.0+0x7c/0xd0 [ 61.100160] el0_svc_common.constprop.0+0xe8/0x100 [ 61.100500] do_el0_svc+0x38/0xa0 [ 61.100736] el0_svc+0x3c/0x198 [ 61.100971] el0t_64_sync_handler+0x134/0x150 [ 61.101280] el0t_64_sync+0x17c/0x180 [ 61.101543] Code: aa1303e0 f00074c1 912b0021 97fffeb2 (d4210000) 2) Without CONFIG_DEBUG_VM_PGFLAGS Always detects "not exclusive" for passed tail pages and refuses to PIN the tail pages R/O, as gup_must_unshare() == true. GUP-fast will fallback to ordinary GUP. As ordinary GUP properly considers the logical hugetlb PTE abstraction in hugetlb_follow_page_mask(), pinning the page will succeed when looking at the PageAnonExclusive on the head page only. So the only real effect of this is that with cont-PTE hugetlb pages, we'll always fallback from GUP-fast to ordinary GUP when not working on the head page, which ends up checking the head page and do the right thing. Consequently, the cow selftests pass with cont-PTE hugetlb pages as well without CONFIG_DEBUG_VM_PGFLAGS. Note that this only applies to anon hugetlb pages that are mapped using cont-PTEs: for example 64k hugetlb pages on a 4k arm64 kernel. ... and only when R/O-pinning (FOLL_PIN) such pages that are mapped into the page table R/O using GUP-fast. On production kernels (and even most debug kernels, that don't set CONFIG_DEBUG_VM_PGFLAGS) this patch should theoretically not be required to be backported. But of course, it does not hurt. Link: https://lkml.kernel.org/r/[email protected] Fixes: a7f226604170 ("mm/gup: trigger FAULT_FLAG_UNSHARE when R/O-pinning a possibly shared anonymous page") Signed-off-by: David Hildenbrand <[email protected]> Reported-by: Ryan Roberts <[email protected]> Reviewed-by: Ryan Roberts <[email protected]> Tested-by: Ryan Roberts <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: John Hubbard <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: Peter Xu <[email protected]> Cc: Mike Kravetz <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-21selftests: cgroup: fix test_kmem_basic less than errorLucas Karpinski1-2/+2
test_kmem_basic creates 100,000 negative dentries, with each one mapping to a slab object. After memory.high is set, these are reclaimed through the shrink_slab function call which reclaims all 100,000 entries. The test passes the majority of the time because when slab1 or current is calculated, it is often above 0, however, 0 is also an acceptable value. Link: https://lkml.kernel.org/r/7d6gcuyzdjcice6qbphrmpmv5skr5jtglg375unnjxqhstvhxc@qkn6dw6bao6v Signed-off-by: Lucas Karpinski <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Muchun Song <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Shuah Khan <[email protected]> Cc: Tejun Heo <[email protected]> Cc: Zefan Li <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-21mm: enable page walking API to lock vmas during the walkSuren Baghdasaryan18-20/+100
walk_page_range() and friends often operate under write-locked mmap_lock. With introduction of vma locks, the vmas have to be locked as well during such walks to prevent concurrent page faults in these areas. Add an additional member to mm_walk_ops to indicate locking requirements for the walk. The change ensures that page walks which prevent concurrent page faults by write-locking mmap_lock, operate correctly after introduction of per-vma locks. With per-vma locks page faults can be handled under vma lock without taking mmap_lock at all, so write locking mmap_lock would not stop them. The change ensures vmas are properly locked during such walks. A sample issue this solves is do_mbind() performing queue_pages_range() to queue pages for migration. Without this change a concurrent page can be faulted into the area and be left out of migration. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Suren Baghdasaryan <[email protected]> Suggested-by: Linus Torvalds <[email protected]> Suggested-by: Jann Horn <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Davidlohr Bueso <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Johannes Weiner <[email protected]> Cc: Laurent Dufour <[email protected]> Cc: Liam Howlett <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Michal Hocko <[email protected]> Cc: Michel Lespinasse <[email protected]> Cc: Peter Xu <[email protected]> Cc: Vlastimil Babka <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-21smaps: use vm_normal_page_pmd() instead of follow_trans_huge_pmd()David Hildenbrand3-5/+8
We shouldn't be using a GUP-internal helper if it can be avoided. Similar to smaps_pte_entry() that uses vm_normal_page(), let's use vm_normal_page_pmd() that similarly refuses to return the huge zeropage. In contrast to follow_trans_huge_pmd(), vm_normal_page_pmd(): (1) Will always return the head page, not a tail page of a THP. If we'd ever call smaps_account with a tail page while setting "compound = true", we could be in trouble, because smaps_account() would look at the memmap of unrelated pages. If we're unlucky, that memmap does not exist at all. Before we removed PG_doublemap, we could have triggered something similar as in commit 24d7275ce279 ("fs/proc: task_mmu.c: don't read mapcount for migration entry"). This can theoretically happen ever since commit ff9f47f6f00c ("mm: proc: smaps_rollup: do not stall write attempts on mmap_lock"): (a) We're in show_smaps_rollup() and processed a VMA (b) We release the mmap lock in show_smaps_rollup() because it is contended (c) We merged that VMA with another VMA (d) We collapsed a THP in that merged VMA at that position If the end address of the original VMA falls into the middle of a THP area, we would call smap_gather_stats() with a start address that falls into a PMD-mapped THP. It's probably very rare to trigger when not really forced. (2) Will succeed on a is_pci_p2pdma_page(), like vm_normal_page() Treat such PMDs here just like smaps_pte_entry() would treat such PTEs. If such pages would be anonymous, we most certainly would want to account them. (3) Will skip over pmd_devmap(), like vm_normal_page() for pte_devmap() As noted in vm_normal_page(), that is only for handling legacy ZONE_DEVICE pages. So just like smaps_pte_entry(), we'll now also ignore such PMD entries. Especially, follow_pmd_mask() never ends up calling follow_trans_huge_pmd() on pmd_devmap(). Instead it calls follow_devmap_pmd() -- which will fail if neither FOLL_GET nor FOLL_PIN is set. So skipping pmd_devmap() pages seems to be the right thing to do. (4) Will properly handle VM_MIXEDMAP/VM_PFNMAP, like vm_normal_page() We won't be returning a memmap that should be ignored by core-mm, or worse, a memmap that does not even exist. Note that while walk_page_range() will skip VM_PFNMAP mappings, walk_page_vma() won't. Most probably this case doesn't currently really happen on the PMD level, otherwise we'd already be able to trigger kernel crashes when reading smaps / smaps_rollup. So most probably only (1) is relevant in practice as of now, but could only cause trouble in extreme corner cases. Let's move follow_trans_huge_pmd() to mm/internal.h to discourage future reuse in wrong context. Link: https://lkml.kernel.org/r/[email protected] Fixes: ff9f47f6f00c ("mm: proc: smaps_rollup: do not stall write attempts on mmap_lock") Signed-off-by: David Hildenbrand <[email protected]> Acked-by: Mel Gorman <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: John Hubbard <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: liubo <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: Peter Xu <[email protected]> Cc: Shuah Khan <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-21mm/gup: reintroduce FOLL_NUMA as FOLL_HONOR_NUMA_FAULTDavid Hildenbrand4-14/+49
Unfortunately commit 474098edac26 ("mm/gup: replace FOLL_NUMA by gup_can_follow_protnone()") missed that follow_page() and follow_trans_huge_pmd() never implicitly set FOLL_NUMA because they really don't want to fail on PROT_NONE-mapped pages -- either due to NUMA hinting or due to inaccessible (PROT_NONE) VMAs. As spelled out in commit 0b9d705297b2 ("mm: numa: Support NUMA hinting page faults from gup/gup_fast"): "Other follow_page callers like KSM should not use FOLL_NUMA, or they would fail to get the pages if they use follow_page instead of get_user_pages." liubo reported [1] that smaps_rollup results are imprecise, because they miss accounting of pages that are mapped PROT_NONE. Further, it's easy to reproduce that KSM no longer works on inaccessible VMAs on x86-64, because pte_protnone()/pmd_protnone() also indictaes "true" in inaccessible VMAs, and follow_page() refuses to return such pages right now. As KVM really depends on these NUMA hinting faults, removing the pte_protnone()/pmd_protnone() handling in GUP code completely is not really an option. To fix the issues at hand, let's revive FOLL_NUMA as FOLL_HONOR_NUMA_FAULT to restore the original behavior for now and add better comments. Set FOLL_HONOR_NUMA_FAULT independent of FOLL_FORCE in is_valid_gup_args(), to add that flag for all external GUP users. Note that there are three GUP-internal __get_user_pages() users that don't end up calling is_valid_gup_args() and consequently won't get FOLL_HONOR_NUMA_FAULT set. 1) get_dump_page(): we really don't want to handle NUMA hinting faults. It specifies FOLL_FORCE and wouldn't have honored NUMA hinting faults already. 2) populate_vma_page_range(): we really don't want to handle NUMA hinting faults. It specifies FOLL_FORCE on accessible VMAs, so it wouldn't have honored NUMA hinting faults already. 3) faultin_vma_page_range(): we similarly don't want to handle NUMA hinting faults. To make the combination of FOLL_FORCE and FOLL_HONOR_NUMA_FAULT work in inaccessible VMAs properly, we have to perform VMA accessibility checks in gup_can_follow_protnone(). As GUP-fast should reject such pages either way in pte_access_permitted()/pmd_access_permitted() -- for example on x86-64 and arm64 that both implement pte_protnone() -- let's just always fallback to ordinary GUP when stumbling over pte_protnone()/pmd_protnone(). As Linus notes [2], honoring NUMA faults might only make sense for selected GUP users. So we should really see if we can instead let relevant GUP callers specify it manually, and not trigger NUMA hinting faults from GUP as default. Prepare for that by making FOLL_HONOR_NUMA_FAULT an external GUP flag and adding appropriate documenation. While at it, remove a stale comment from follow_trans_huge_pmd(): That comment for pmd_protnone() was added in commit 2b4847e73004 ("mm: numa: serialise parallel get_user_page against THP migration"), which noted: THP does not unmap pages due to a lack of support for migration entries at a PMD level. This allows races with get_user_pages Nowadays, we do have PMD migration entries, so the comment no longer applies. Let's drop it. [1] https://lore.kernel.org/r/[email protected] [2] https://lore.kernel.org/r/CAHk-=wgRiP_9X0rRdZKT8nhemZGNateMtb366t37d8-x7VRs=g@mail.gmail.com Link: https://lkml.kernel.org/r/[email protected] Fixes: 474098edac26 ("mm/gup: replace FOLL_NUMA by gup_can_follow_protnone()") Signed-off-by: David Hildenbrand <[email protected]> Reported-by: liubo <[email protected]> Closes: https://lore.kernel.org/r/[email protected] Reported-by: Peter Xu <[email protected]> Closes: https://lore.kernel.org/all/ZMKJjDaqZ7FW0jfe@x1n/ Acked-by: Mel Gorman <[email protected]> Acked-by: Peter Xu <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Jason Gunthorpe <[email protected]> Cc: John Hubbard <[email protected]> Cc: Linus Torvalds <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Mel Gorman <[email protected]> Cc: Paolo Bonzini <[email protected]> Cc: Shuah Khan <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-08-21cpufreq: intel_pstate: set stale CPU frequency to minimumDoug Smythies1-0/+5
The intel_pstate CPU frequency scaling driver does not use policy->cur and it is 0. When the CPU frequency is outdated arch_freq_get_on_cpu() will default to the nominal clock frequency when its call to cpufreq_quick_getpolicy_cur returns the never updated 0. Thus, the listed frequency might be outside of currently set limits. Some users are complaining about the high reported frequency, albeit stale, when their system is idle and/or it is above the reduced maximum they have set. This patch will maintain policy_cur for the intel_pstate driver at the current minimum CPU frequency. Reported-by: Yang Jie <[email protected]> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217597 Signed-off-by: Doug Smythies <[email protected]> [ rjw: White space damage fixes and comment adjustment ] Signed-off-by: Rafael J. Wysocki <[email protected]>
2023-08-21cpufreq: stats: Improve the performance of cpufreq_stats_create_table()Liao Chang1-1/+2
In the worst case, the freq_table of policy data is not sorted and contains duplicate frequencies, this means that it needs to iterate through the entire freq_table of policy to ensure each frequency is unique in the freq_table of stats data, this has a time complexity of O(N^2), where N is the number of frequencies in the freq_table of policy. However, if the policy.freq_table is already sorted and contains no duplicate frequencies, it can reduce the time complexity of creating stats.freq_table to O(N), the 'freq_table_sorted' field of policy data can be used to indicate whether the policy.freq_table is sorted. Signed-off-by: Liao Chang <[email protected]> Acked-by: Viresh Kumar <[email protected]> Reviewed-by: Dhruva Gole <[email protected]> [ rjw: Fix typo in changelog, remove redundant parens ] Signed-off-by: Rafael J. Wysocki <[email protected]>
2023-08-21nsproxy: Convert nsproxy.count to refcount_tElena Reshetova2-6/+5
atomic_t variables are currently used to implement reference counters with the following properties: - counter is initialized to 1 using atomic_set() - a resource is freed upon counter reaching zero - once counter reaches zero, its further increments aren't allowed - counter schema uses basic atomic operations (set, inc, inc_not_zero, dec_and_test, etc.) Such atomic variables should be converted to a newly provided refcount_t type and API that prevents accidental counter overflows and underflows. This is important since overflows and underflows can lead to use-after-free situation and be exploitable. The variable nsproxy.count is used as pure reference counter. Convert it to refcount_t and fix up the operations. **Important note for maintainers: Some functions from refcount_t API defined in refcount.h have different memory ordering guarantees than their atomic counterparts. Please check Documentation/core-api/refcount-vs-atomic.rst for more information. Normally the differences should not matter since refcount_t provides enough guarantees to satisfy the refcounting use cases, but in some rare cases it might matter. Please double check that you don't have some undocumented memory guarantees for this variable usage. For the nsproxy.count it might make a difference in following places: - put_nsproxy() and switch_task_namespaces(): decrement in refcount_dec_and_test() only provides RELEASE ordering and ACQUIRE ordering on success vs. fully ordered atomic counterpart Suggested-by: Kees Cook <[email protected]> Signed-off-by: Elena Reshetova <[email protected]> Reviewed-by: David Windsor <[email protected]> Reviewed-by: Hans Liljestrand <[email protected]> Reviewed-by: Christian Brauner <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Kees Cook <[email protected]>