aboutsummaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)AuthorFilesLines
2024-09-04Merge tag 'wireless-next-2024-09-04' of ↵Jakub Kicinski4-16/+49
git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next Kalle Valo says: ==================== pull-request: wireless-next-2024-09-04 here's a pull request to net-next tree, more info below. Please let me know if there are any problems. ==================== Conflicts: drivers/net/wireless/ath/ath12k/hw.c 38055789d151 ("wifi: ath12k: use 128 bytes aligned iova in transmit path for WCN7850") 8be12629b428 ("wifi: ath12k: restore ASPM for supported hardwares only") https://lore.kernel.org/[email protected] Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2024-09-04ipv4: Fix user space build failure due to header changeIdo Schimmel2-2/+1
RT_TOS() from include/uapi/linux/in_route.h is defined using IPTOS_TOS_MASK from include/uapi/linux/ip.h. This is problematic for files such as include/net/ip_fib.h that want to use RT_TOS() as without including both header files kernel compilation fails: In file included from ./include/net/ip_fib.h:25, from ./include/net/route.h:27, from ./include/net/lwtunnel.h:9, from net/core/dst.c:24: ./include/net/ip_fib.h: In function ‘fib_dscp_masked_match’: ./include/uapi/linux/in_route.h:31:32: error: ‘IPTOS_TOS_MASK’ undeclared (first use in this function) 31 | #define RT_TOS(tos) ((tos)&IPTOS_TOS_MASK) | ^~~~~~~~~~~~~~ ./include/net/ip_fib.h:440:45: note: in expansion of macro ‘RT_TOS’ 440 | return dscp == inet_dsfield_to_dscp(RT_TOS(fl4->flowi4_tos)); Therefore, cited commit changed linux/in_route.h to include linux/ip.h. However, as reported by David, this breaks iproute2 compilation due overlapping definitions between linux/ip.h and /usr/include/netinet/ip.h: In file included from ../include/uapi/linux/in_route.h:5, from iproute.c:19: ../include/uapi/linux/ip.h:25:9: warning: "IPTOS_TOS" redefined 25 | #define IPTOS_TOS(tos) ((tos)&IPTOS_TOS_MASK) | ^~~~~~~~~ In file included from iproute.c:17: /usr/include/netinet/ip.h:222:9: note: this is the location of the previous definition 222 | #define IPTOS_TOS(tos) ((tos) & IPTOS_TOS_MASK) Fix by changing include/net/ip_fib.h to include linux/ip.h. Note that usage of RT_TOS() should not spread further in the kernel due to recent work in this area. Fixes: 1fa3314c14c6 ("ipv4: Centralize TOS matching") Reported-by: David Ahern <[email protected]> Closes: https://lore.kernel.org/netdev/[email protected]/ Signed-off-by: Ido Schimmel <[email protected]> Reviewed-by: David Ahern <[email protected]> Reviewed-by: Guillaume Nault <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2024-09-04net: mana: Improve mana_set_channels() in low mem conditionsShradha Gupta1-1/+1
The mana_set_channels() function requires detaching the mana driver and reattaching it with changed channel values. During this operation if the system is low on memory, the reattach might fail, causing the network device being down. To avoid this we pre-allocate buffers at the beginning of set operation, to prevent complete network loss Signed-off-by: Shradha Gupta <[email protected]> Reviewed-by: Gerhard Engleder <[email protected]> Reviewed-by: Haiyang Zhang <[email protected]> Link: https://patch.msgid.link/1725248734-21760-1-git-send-email-shradhagupta@linux.microsoft.com Signed-off-by: Jakub Kicinski <[email protected]>
2024-09-04PCI/NPEM: Add Native PCIe Enclosure Management supportMariusz Tkaczyk2-0/+38
Native PCIe Enclosure Management (NPEM, PCIe r6.1 sec 6.28) allows managing LEDs in storage enclosures. NPEM is indication oriented and it does not give direct access to LEDs. Although each indication *could* represent an individual LED, multiple indications could also be represented as a single, multi-color LED or a single LED blinking in a specific interval. The specification leaves that open. Each enabled indication (capability register bit on) is represented as a ledclass_dev which can be controlled through sysfs. For every ledclass device only 2 brightness states are allowed: LED_ON (1) or LED_OFF (0). This corresponds to the NPEM control register (Indication bit on/off). Ledclass devices appear in sysfs as child devices (subdirectory) of PCI device which has an NPEM Extended Capability and indication is enabled in NPEM capability register. For example, these are LEDs created for pcieport "10000:02:05.0" on my setup: leds/ ├── 10000:02:05.0:enclosure:fail ├── 10000:02:05.0:enclosure:locate ├── 10000:02:05.0:enclosure:ok └── 10000:02:05.0:enclosure:rebuild They can be also found in "/sys/class/leds" directory. The parent PCIe device domain/bus/device/function address is used to guarantee uniqueness across leds subsystem. To enable/disable a "fail" indication, the "brightness" file can be edited: echo 1 > ./leds/10000:02:05.0:enclosure:fail/brightness echo 0 > ./leds/10000:02:05.0:enclosure:fail/brightness PCIe r6.1, sec 7.9.19.2 defines the possible indications. Multiple indications for same parent PCIe device can conflict and hardware may update them when processing new request. To avoid issues, driver refresh all indications by reading back control register. This driver expects to be the exclusive NPEM extended capability manager. It waits up to 1 second after imposing new request, it doesn't verify if controller is busy before write, and it assumes the mutex lock gives protection from concurrent updates. If _DSM LED management is available, we assume the platform may be using NPEM for its own purposes (see PCI Firmware Spec r3.3 sec 4.7), so the driver does not use NPEM. A future patch will add _DSM support; an info message notes whether NPEM or _DSM is being used. NPEM is a PCIe extended capability so it should be registered in pcie_init_capabilities() but it is not possible due to LED dependency. The parent pci_device must be added earlier for led_classdev_register() to be successful. NPEM does not require configuration on kernel side, so it is safe to register LED devices later. Link: https://lore.kernel.org/r/[email protected] Suggested-by: Lukas Wunner <[email protected]> Signed-off-by: Mariusz Tkaczyk <[email protected]> Signed-off-by: Bjorn Helgaas <[email protected]> Tested-by: Stuart Hayes <[email protected]> Reviewed-by: Christoph Hellwig <[email protected]> Reviewed-by: Ilpo Järvinen <[email protected]>
2024-09-04Merge branch 'bpf/master' into for-6.12Tejun Heo51-118/+344
Pull bpf/master to receive baebe9aaba1e ("bpf: allow passing struct bpf_iter_<type> as kfunc arguments") and related changes in preparation for the DSQ iterator patchset. Signed-off-by: Tejun Heo <[email protected]>
2024-09-04sched_ext: Add cgroup supportTejun Heo1-0/+3
Add sched_ext_ops operations to init/exit cgroups, and track task migrations and config changes. A BPF scheduler may not implement or implement only subset of cgroup features. The implemented features can be indicated using %SCX_OPS_HAS_CGOUP_* flags. If cgroup configuration makes use of features that are not implemented, a warning is triggered. While a BPF scheduler is being enabled and disabled, relevant cgroup operations are locked out using scx_cgroup_rwsem. This avoids situations like task prep taking place while the task is being moved across cgroups, making things easier for BPF schedulers. v7: - cgroup interface file visibility toggling is dropped in favor just warning messages. Dynamically changing interface visiblity caused more confusion than helping. v6: - Updated to reflect the removal of SCX_KF_SLEEPABLE. - Updated to use CONFIG_GROUP_SCHED_WEIGHT and fixes for !CONFIG_FAIR_GROUP_SCHED && CONFIG_EXT_GROUP_SCHED. v5: - Flipped the locking order between scx_cgroup_rwsem and cpus_read_lock() to avoid locking order conflict w/ cpuset. Better documentation around locking. - sched_move_task() takes an early exit if the source and destination are identical. This triggered the warning in scx_cgroup_can_attach() as it left p->scx.cgrp_moving_from uncleared. Updated the cgroup migration path so that ops.cgroup_prep_move() is skipped for identity migrations so that its invocations always match ops.cgroup_move() one-to-one. v4: - Example schedulers moved into their own patches. - Fix build failure when !CONFIG_CGROUP_SCHED, reported by Andrea Righi. v3: - Make scx_example_pair switch all tasks by default. - Convert to BPF inline iterators. - scx_bpf_task_cgroup() is added to determine the current cgroup from CPU controller's POV. This allows BPF schedulers to accurately track CPU cgroup membership. - scx_example_flatcg added. This demonstrates flattened hierarchy implementation of CPU cgroup control and shows significant performance improvement when cgroups which are nested multiple levels are under competition. v2: - Build fixes for different CONFIG combinations. Signed-off-by: Tejun Heo <[email protected]> Reviewed-by: David Vernet <[email protected]> Acked-by: Josh Don <[email protected]> Acked-by: Hao Luo <[email protected]> Acked-by: Barret Rhoden <[email protected]> Reported-by: kernel test robot <[email protected]> Cc: Andrea Righi <[email protected]>
2024-09-04sched_ext: TASK_DEAD tasks must be switched into SCX on ops_enableTejun Heo1-0/+5
During scx_ops_enable(), SCX needs to invoke the sleepable ops.init_task() on every task. To do this, it does get_task_struct() on each iterated task, drop the lock and then call ops.init_task(). However, a TASK_DEAD task may already have lost all its usage count and be waiting for RCU grace period to be freed. If get_task_struct() is called on such task, use-after-free can happen. To avoid such situations, scx_ops_enable() skips initialization of TASK_DEAD tasks, which seems safe as they are never going to be scheduled again. Unfortunately, a racing sched_setscheduler(2) can grab the task before the task is unhashed and then continue to e.g. move the task from RT to SCX after TASK_DEAD is set and ops_enable skipped the task. As the task hasn't gone through scx_ops_init_task(), scx_ops_enable_task() called from switching_to_scx() triggers the following warning: sched_ext: Invalid task state transition 0 -> 3 for stress-ng-race-[2872] WARNING: CPU: 6 PID: 2367 at kernel/sched/ext.c:3327 scx_ops_enable_task+0x18f/0x1f0 ... RIP: 0010:scx_ops_enable_task+0x18f/0x1f0 ... switching_to_scx+0x13/0xa0 __sched_setscheduler+0x84e/0xa50 do_sched_setscheduler+0x104/0x1c0 __x64_sys_sched_setscheduler+0x18/0x30 do_syscall_64+0x7b/0x140 entry_SYSCALL_64_after_hwframe+0x76/0x7e As in the ops_disable path, it just doesn't seem like a good idea to leave any task in an inconsistent state, even when the task is dead. The root cause is ops_enable not being able to tell reliably whether a task is truly dead (no one else is looking at it and it's about to be freed) and was testing TASK_DEAD instead. Fix it by testing the task's usage count directly. - ops_init no longer ignores TASK_DEAD tasks. As now all users iterate all tasks, @include_dead is removed from scx_task_iter_next_locked() along with dead task filtering. - tryget_task_struct() is added. Tasks are skipped iff tryget_task_struct() fails. Signed-off-by: Tejun Heo <[email protected]> Cc: David Vernet <[email protected]> Cc: Peter Zijlstra <[email protected]>
2024-09-04bpf: Remove the insn_buf array stack usage from the inline_bpf_loop()Martin KaFai Lau1-1/+1
This patch removes the insn_buf array stack usage from the inline_bpf_loop(). Instead, the env->insn_buf is used. The usage in inline_bpf_loop() needs more than 16 insn, so the INSN_BUF_SIZE needs to be increased from 16 to 32. The compiler stack size warning on the verifier is gone after this change. Cc: Eduard Zingerman <[email protected]> Signed-off-by: Martin KaFai Lau <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2024-09-05Merge tag 'ib-psy-usb-types-signed' of ↵Chanwoo Choi1-2/+1
git://git.kernel.org/pub/scm/linux/kernel/git/sre/linux-power-supply into extcon-next Immutable branch for usb_types change for v6.12 Changing usb_types type from array to bitmap in the power_supply_desc struct requires updating power-supply drivers living in different subsystem, so it is handled via an immutable branch.
2024-09-04KVM: Add arch hooks for enabling/disabling virtualizationSean Christopherson1-0/+14
Add arch hooks that are invoked when KVM enables/disable virtualization. x86 will use the hooks to register an "emergency disable" callback, which is essentially an x86-specific shutdown notifier that is used when the kernel is doing an emergency reboot/shutdown/kexec. Add comments for the declarations to help arch code understand exactly when the callbacks are invoked. Alternatively, the APIs themselves could communicate most of the same info, but kvm_arch_pre_enable_virtualization() and kvm_arch_post_disable_virtualization() are a bit cumbersome, and make it a bit less obvious that they are intended to be implemented as a pair. Reviewed-by: Chao Gao <[email protected]> Reviewed-by: Kai Huang <[email protected]> Acked-by: Kai Huang <[email protected]> Tested-by: Farrah Chen <[email protected]> Signed-off-by: Sean Christopherson <[email protected]> Message-ID: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2024-09-04KVM: Rename arch hooks related to per-CPU virtualization enablingSean Christopherson1-2/+2
Rename the per-CPU hooks used to enable virtualization in hardware to align with the KVM-wide helpers in kvm_main.c, and to better capture that the callbacks are invoked on every online CPU. No functional change intended. Suggested-by: Paolo Bonzini <[email protected]> Signed-off-by: Sean Christopherson <[email protected]> Reviewed-by: Kai Huang <[email protected]> Message-ID: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2024-09-04fuse: allow idmapped mountsAlexander Mikhalitsyn1-1/+19
Now we have everything in place and we can allow idmapped mounts by setting the FS_ALLOW_IDMAP flag. Notice that real availability of idmapped mounts will depend on the fuse daemon. Fuse daemon have to set FUSE_ALLOW_IDMAP flag in the FUSE_INIT reply. To discuss: - we enable idmapped mounts support only if "default_permissions" mode is enabled, because otherwise we would need to deal with UID/GID mappings in the userspace side OR provide the userspace with idmapped req->in.h.uid/req->in.h.gid values which is not something that we probably want to. Idmapped mounts philosophy is not about faking caller uid/gid. Some extra links and examples: - libfuse support https://github.com/mihalicyn/libfuse/commits/idmap_support - fuse-overlayfs support: https://github.com/mihalicyn/fuse-overlayfs/commits/idmap_support - cephfs-fuse conversion example https://github.com/mihalicyn/ceph/commits/fuse_idmap - glusterfs conversion example https://github.com/mihalicyn/glusterfs/commits/fuse_idmap Signed-off-by: Alexander Mikhalitsyn <[email protected]> Reviewed-by: Christian Brauner <[email protected]> Signed-off-by: Miklos Szeredi <[email protected]>
2024-09-04fuse: add basic infrastructure to support idmappingsAlexander Mikhalitsyn1-0/+2
Add some preparational changes in fuse_get_req/fuse_force_creds to handle idmappings. Miklos suggested [1], [2] to change the meaning of in.h.uid/in.h.gid fields when daemon declares support for idmapped mounts. In a new semantic, we fill uid/gid values in fuse header with a id-mapped caller uid/gid (for requests which create new inodes), for all the rest cases we just send -1 to userspace. No functional changes intended. Link: https://lore.kernel.org/all/CAJfpegsVY97_5mHSc06mSw79FehFWtoXT=hhTUK_E-Yhr7OAuQ@mail.gmail.com/ [1] Link: https://lore.kernel.org/all/CAJfpegtHQsEUuFq1k4ZbTD3E1h-GsrN3PWyv7X8cg6sfU_W2Yw@mail.gmail.com/ [2] Signed-off-by: Alexander Mikhalitsyn <[email protected]> Signed-off-by: Miklos Szeredi <[email protected]>
2024-09-04namespace: introduce SB_I_NOIDMAP flagAlexander Mikhalitsyn1-0/+1
Right now we determine if filesystem support vfs idmappings or not basing on the FS_ALLOW_IDMAP flag presence. This "static" way works perfecly well for local filesystems like ext4, xfs, btrfs, etc. But for network-like filesystems like fuse, cephfs this approach is not ideal, because sometimes proper support of vfs idmaps requires some extensions for the on-wire protocol, which implies that changes have to be made not only in the Linux kernel code but also in the 3rd party components like libfuse, cephfs MDS server and so on. We have seen that issue during our work on cephfs idmapped mounts [1] with Christian, but right now I'm working on the idmapped mounts support for fuse/virtiofs and I think that it is a right time for this extension. [1] 5ccd8530dd7 ("ceph: handle idmapped mounts in create_request_message()") Suggested-by: Christian Brauner <[email protected]> Signed-off-by: Alexander Mikhalitsyn <[email protected]> Reviewed-by: Christian Brauner <[email protected]> Signed-off-by: Miklos Szeredi <[email protected]>
2024-09-04kvm: Note an RCU quiescent state on guest exitLeonardo Bras2-3/+13
As of today, KVM notes a quiescent state only in guest entry, which is good as it avoids the guest being interrupted for current RCU operations. While the guest vcpu runs, it can be interrupted by a timer IRQ that will check for any RCU operations waiting for this CPU. In case there are any of such, it invokes rcu_core() in order to sched-out the current thread and note a quiescent state. This occasional schedule work will introduce tens of microsseconds of latency, which is really bad for vcpus running latency-sensitive applications, such as real-time workloads. So, note a quiescent state in guest exit, so the interrupted guests is able to deal with any pending RCU operations before being required to invoke rcu_core(), and thus avoid the overhead of related scheduler work. Signed-off-by: Leonardo Bras <[email protected]> Acked-by: Paul E. McKenney <[email protected]> Acked-by: Sean Christopherson <[email protected]> Message-ID: <[email protected]> Signed-off-by: Paolo Bonzini <[email protected]>
2024-09-04ARM: 9416/1: amba: make amba_bustype constantKunwu Chan1-1/+1
Since commit d492cc2573a0 ("driver core: device.h: make struct bus_type a const *"), the driver core can properly handle constant struct bus_type, move the amba_bustype variable to be a constant structure as well, placing it into read-only memory which can not be modified at runtime. Cc: Greg Kroah-Hartman <[email protected]> Suggested-by: Greg Kroah-Hartman <[email protected]> Signed-off-by: Kunwu Chan <[email protected]> Signed-off-by: Russell King (Oracle) <[email protected]>
2024-09-04printk: nbcon: Show replay message on takeoverJohn Ogness1-0/+2
An emergency or panic context can takeover console ownership while the current owner was printing a printk message. The atomic printer will re-print the message that the previous owner was printing. However, this can look confusing to the user and may even seem as though a message was lost. [3430014.1 [3430014.181123] usb 1-2: Product: USB Audio Add a new field @nbcon_prev_seq to struct console to track the sequence number to print that was assigned to the previous console owner. If this matches the sequence number to print that the current owner is assigned, then a takeover must have occurred. In this case, print an additional message to inform the user that the previous message is being printed again. [3430014.1 ** replaying previous printk message ** [3430014.181123] usb 1-2: Product: USB Audio Signed-off-by: John Ogness <[email protected]> Reviewed-by: Petr Mladek <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Petr Mladek <[email protected]>
2024-09-04printk: nbcon: Introduce printer kthreadsThomas Gleixner1-0/+40
Provide the main implementation for running a printer kthread per nbcon console that is takeover/handover aware. This includes: - new mandatory write_thread() callback - kthread creation - kthread main printing loop - kthread wakeup mechanism - kthread shutdown kthread creation is a bit tricky because consoles may register before kthreads can be created. In such cases, registration will succeed, even though no kthread exists. Once kthreads can be created, an early_initcall will set @printk_kthreads_ready. If there are no registered boot consoles, the early_initcall creates the kthreads for all registered nbcon consoles. If kthread creation fails, the related console is unregistered. If there are registered boot consoles when @printk_kthreads_ready is set, no kthreads are created until the final boot console unregisters. Once kthread creation finally occurs, @printk_kthreads_running is set so that the system knows kthreads are available for all registered nbcon consoles. If @printk_kthreads_running is already set when the console is registering, the kthread is created during registration. If kthread creation fails, the registration will fail. Until @printk_kthreads_running is set, console printing occurs directly via the console_lock. kthread shutdown on system shutdown/reboot is necessary to ensure the printer kthreads finish their printing so that the system can cleanly transition back to direct printing via the console_lock in order to reliably push out the final shutdown/reboot messages. @printk_kthreads_running is cleared before shutting down the individual kthreads. The kthread uses a new mandatory write_thread() callback that is called with both device_lock() and the console context acquired. The console ownership handling is necessary for synchronization against write_atomic() which is synchronized only via the console context ownership. The device_lock() serializes acquiring the console context with NBCON_PRIO_NORMAL. It is needed in case the device_lock() does not disable preemption. It prevents the following race: CPU0 CPU1 [ task A ] nbcon_context_try_acquire() # success with NORMAL prio # .unsafe == false; // safe for takeover [ schedule: task A -> B ] WARN_ON() nbcon_atomic_flush_pending() nbcon_context_try_acquire() # success with EMERGENCY prio # flushing nbcon_context_release() # HERE: con->nbcon_state is free # to take by anyone !!! nbcon_context_try_acquire() # success with NORMAL prio [ task B ] [ schedule: task B -> A ] nbcon_enter_unsafe() nbcon_context_can_proceed() BUG: nbcon_context_can_proceed() returns "true" because the console is owned by a context on CPU0 with NBCON_PRIO_NORMAL. But it should return "false". The console is owned by a context from task B and we do the check in a context from task A. Note that with these changes, the printer kthreads do not yet take over full responsibility for nbcon printing during normal operation. These changes only focus on the lifecycle of the kthreads. Co-developed-by: John Ogness <[email protected]> Signed-off-by: John Ogness <[email protected]> Signed-off-by: Thomas Gleixner (Intel) <[email protected]> Reviewed-by: Petr Mladek <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Petr Mladek <[email protected]>
2024-09-04printk: nbcon: Add function for printers to reacquire ownershipJohn Ogness1-0/+6
Since ownership can be lost at any time due to handover or takeover, a printing context _must_ be prepared to back out immediately and carefully. However, there are scenarios where the printing context must reacquire ownership in order to finalize or revert hardware changes. One such example is when interrupts are disabled during printing. No other context will automagically re-enable the interrupts. For this case, the disabling context _must_ reacquire nbcon ownership so that it can re-enable the interrupts. Provide nbcon_reacquire_nobuf() for exactly this purpose. It allows a printing context to reacquire ownership using the same priority as its previous ownership. Note that after a successful reacquire the printing context will have no output buffer because that has been lost. This function cannot be used to resume printing. Signed-off-by: John Ogness <[email protected]> Reviewed-by: Petr Mladek <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Petr Mladek <[email protected]>
2024-09-04firewire: core: add local API to queue work item to workqueue specific to ↵Takashi Sakamoto1-0/+1
isochronous contexts In the previous commit, the workqueue is added per the instance of fw_card structure for isochronous contexts. The workqueue is designed to be used by the implementation of fw_card_driver structure underlying the fw_card. This commit adds some local APIs to be used by the implementation. Tested-by: Edmund Raile <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Takashi Sakamoto <[email protected]>
2024-09-04firewire: core: allocate workqueue to handle isochronous contexts in cardTakashi Sakamoto1-0/+2
This commit adds a workqueue dedicated for isochronous context processing. The workqueue is allocated per instance of fw_card structure to satisfy the following characteristics descending from 1394 OHCI specification: In 1394 OHCI specification, memory pages are reserved to each isochronous context dedicated to DMA transmission. It allows to operate these per-context pages concurrently. Software can schedule hardware interrupt for several isochronous context to the same cycle, thus WQ_UNBOUND is specified. Additionally, it is sleepable to operate the content of pages, thus WQ_BH is not used. The isochronous context delivers the packets with time stamp, thus WQ_HIGHPRI is specified for semi real-time data such as IEC 61883-1/6 protocol implemented by ALSA firewire stack. The isochronous context is not used by the implementation of SCSI over IEEE1394 protocol (sbp2), thus WQ_MEM_RECLAIM is not specified. It is useful for users to adjust cpu affinity of the workqueue depending on their work loads, thus WQ_SYS is specified to expose the attributes to user space. Tested-by: Edmund Raile <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Takashi Sakamoto <[email protected]>
2024-09-04arm64/ptrace: add support for FEAT_POEJoey Gouly1-0/+1
Add a regset for POE containing POR_EL0. Signed-off-by: Joey Gouly <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Will Deacon <[email protected]> Reviewed-by: Mark Brown <[email protected]> Reviewed-by: Catalin Marinas <[email protected]> Reviewed-by: Anshuman Khandual <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]>
2024-09-04arm64: re-order MTE VM_ flagsJoey Gouly1-2/+2
VM_PKEY_BIT[012] will use VM_HIGH_ARCH_[012], move the MTE VM flags to accommodate this. Signed-off-by: Joey Gouly <[email protected]> Cc: Catalin Marinas <[email protected]> Cc: Will Deacon <[email protected]> Acked-by: Catalin Marinas <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]>
2024-09-04mm: use ARCH_PKEY_BITS to define VM_PKEY_BITNJoey Gouly1-6/+10
Use the new CONFIG_ARCH_PKEY_BITS to simplify setting these bits for different architectures. Signed-off-by: Joey Gouly <[email protected]> Cc: Andrew Morton <[email protected]> Cc: [email protected] Cc: [email protected] Acked-by: Dave Hansen <[email protected]> Reviewed-by: Anshuman Khandual <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]>
2024-09-04net: mana: Fix error handling in mana_create_txq/rxq's NAPI cleanupSouradeep Chakrabarti1-0/+2
Currently napi_disable() gets called during rxq and txq cleanup, even before napi is enabled and hrtimer is initialized. It causes kernel panic. ? page_fault_oops+0x136/0x2b0 ? page_counter_cancel+0x2e/0x80 ? do_user_addr_fault+0x2f2/0x640 ? refill_obj_stock+0xc4/0x110 ? exc_page_fault+0x71/0x160 ? asm_exc_page_fault+0x27/0x30 ? __mmdrop+0x10/0x180 ? __mmdrop+0xec/0x180 ? hrtimer_active+0xd/0x50 hrtimer_try_to_cancel+0x2c/0xf0 hrtimer_cancel+0x15/0x30 napi_disable+0x65/0x90 mana_destroy_rxq+0x4c/0x2f0 mana_create_rxq.isra.0+0x56c/0x6d0 ? mana_uncfg_vport+0x50/0x50 mana_alloc_queues+0x21b/0x320 ? skb_dequeue+0x5f/0x80 Cc: [email protected] Fixes: e1b5683ff62e ("net: mana: Move NAPI from EQ to CQ") Signed-off-by: Souradeep Chakrabarti <[email protected]> Reviewed-by: Haiyang Zhang <[email protected]> Reviewed-by: Shradha Gupta <[email protected]> Signed-off-by: David S. Miller <[email protected]>
2024-09-04iommu/amd: Store the nid in io_pgtable_cfg instead of the domainJason Gunthorpe1-0/+4
We already have memory in the union here that is being wasted in AMD's case, use it to store the nid. Putting the nid here further isolates the io_pgtable code from the struct protection_domain. Fixup protection_domain_alloc so that the NID from the device is provided, at this point dev is never NULL for AMD so this will now allocate the first table pointer on the correct NUMA node. Signed-off-by: Jason Gunthorpe <[email protected]> Reviewed-by: Vasant Hegde <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Joerg Roedel <[email protected]>
2024-09-03sched_ext: Replace SCX_TASK_BAL_KEEP with SCX_RQ_BAL_KEEPTejun Heo1-1/+0
SCX_TASK_BAL_KEEP is used by balance_one() to tell pick_next_task_scx() to keep running the current task. It's not really a task property. Replace it with SCX_RQ_BAL_KEEP which resides in rq->scx.flags and is a better fit for the usage. Also, the existing clearing rule is unnecessarily strict and makes it difficult to use with core-sched. Just clear it on entry to balance_one(). Signed-off-by: Tejun Heo <[email protected]>
2024-09-04gpio: davinci: drop platform data supportBartosz Golaszewski1-21/+0
There are no more any board files that use the platform data for gpio-davinci. We can remove the header defining it and port the code to no longer store any context in pdata. Reviewed-by: Linus Walleij <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Bartosz Golaszewski <[email protected]>
2024-09-03mm: migrate: add isolate_folio_to_list()Kefeng Wang1-0/+3
Add isolate_folio_to_list() helper to try to isolate HugeTLB, no-LRU movable and LRU folios to a list, which will be reused by do_migrate_range() from memory hotplug soon, also drop the mf_isolate_folio() since we could directly use new helper in the soft_offline_in_use_page(). Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Kefeng Wang <[email protected]> Acked-by: David Hildenbrand <[email protected]> Acked-by: Miaohe Lin <[email protected]> Tested-by: Miaohe Lin <[email protected]> Cc: Dan Carpenter <[email protected]> Cc: Jonathan Cameron <[email protected]> Cc: Naoya Horiguchi <[email protected]> Cc: Oscar Salvador <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-03ipc/shm, mm: drop do_vma_munmap()Liam R. Howlett1-3/+3
The do_vma_munmap() wrapper existed for callers that didn't have a vma iterator and needed to check the vma mseal status prior to calling the underlying munmap(). All callers now use a vma iterator and since the mseal check has been moved to do_vmi_align_munmap() and the vmas are aligned, this function can just be called instead. do_vmi_align_munmap() can no longer be static as ipc/shm is using it and it is exported via the mm.h header. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Liam R. Howlett <[email protected]> Reviewed-by: Lorenzo Stoakes <[email protected]> Cc: Bert Karwatzki <[email protected]> Cc: Jeff Xu <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Kees Cook <[email protected]> Cc: Lorenzo Stoakes <[email protected]> Cc: Mark Brown <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: "Paul E. McKenney" <[email protected]> Cc: Paul Moore <[email protected]> Cc: Sidhartha Kumar <[email protected]> Cc: Suren Baghdasaryan <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-03mm: store zero pages to be swapped out in a bitmapUsama Arif1-0/+1
Patch series "mm: store zero pages to be swapped out in a bitmap", v8. As shown in the patch series that introduced the zswap same-filled optimization [1], 10-20% of the pages stored in zswap are same-filled. This is also observed across Meta's server fleet. By using VM counters in swap_writepage (not included in this patchseries) it was found that less than 1% of the same-filled pages to be swapped out are non-zero pages. For conventional swap setup (without zswap), rather than reading/writing these pages to flash resulting in increased I/O and flash wear, a bitmap can be used to mark these pages as zero at write time, and the pages can be filled at read time if the bit corresponding to the page is set. When using zswap with swap, this also means that a zswap_entry does not need to be allocated for zero filled pages resulting in memory savings which would offset the memory used for the bitmap. A similar attempt was made earlier in [2] where zswap would only track zero-filled pages instead of same-filled. This patchseries adds zero-filled pages optimization to swap (hence it can be used even if zswap is disabled) and removes the same-filled code from zswap (as only 1% of the same-filled pages are non-zero), simplifying code. [1] https://lore.kernel.org/all/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1/ [2] https://lore.kernel.org/lkml/[email protected]/ This patch (of 2): Approximately 10-20% of pages to be swapped out are zero pages [1]. Rather than reading/writing these pages to flash resulting in increased I/O and flash wear, a bitmap can be used to mark these pages as zero at write time, and the pages can be filled at read time if the bit corresponding to the page is set. With this patch, NVMe writes in Meta server fleet decreased by almost 10% with conventional swap setup (zswap disabled). [1] https://lore.kernel.org/all/20171018104832epcms5p1b2232e2236258de3d03d1344dde9fce0@epcms5p1/ Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Usama Arif <[email protected]> Reviewed-by: Chengming Zhou <[email protected]> Reviewed-by: Yosry Ahmed <[email protected]> Reviewed-by: Nhat Pham <[email protected]> Acked-by: Johannes Weiner <[email protected]> Cc: David Hildenbrand <[email protected]> Cc: Hugh Dickins <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Shakeel Butt <[email protected]> Cc: Usama Arif <[email protected]> Cc: Andi Kleen <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-03x86: remove PG_uncachedMatthew Wilcox (Oracle)3-23/+14
Convert x86 to use PG_arch_2 instead of PG_uncached and remove PG_uncached. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-03mm: rename PG_mappedtodisk to PG_owner_2Matthew Wilcox (Oracle)3-10/+18
This flag has similar constraints to PG_owner_priv_1 -- it is ignored by core code, and is entirely for the use of the code which allocated the folio. Since the pagecache does not use it, individual filesystems can use it. The bufferhead code does use it, so filesystems which use the buffer cache must not use it for another purpose. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-03mm: remove page_has_private()Matthew Wilcox (Oracle)1-10/+5
This function has no more callers, except folio_has_private(). Combine the two functions. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-03mm: remove PageOwnerPriv1Matthew Wilcox (Oracle)1-2/+0
While there are many aliases for this flag, nobody actually uses the *PageOwnerPriv1() nor folio_*_owner_priv_1() accessors. Remove them. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-03mm: remove PageMlockedMatthew Wilcox (Oracle)1-5/+8
This flag is now only used on folios, so we can remove all the page accessors. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-03mm: remove PageUnevictableMatthew Wilcox (Oracle)1-3/+3
There is only one caller of PageUnevictable() left; convert it to call folio_test_unevictable() and remove all the page accessors. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-03mm: remove PageSwapCacheMatthew Wilcox (Oracle)2-9/+4
This flag is now only used on folios, so we can remove all the page accessors and reword the comments that refer to them. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-03mm: remove PageReadaheadMatthew Wilcox (Oracle)1-2/+2
This flag is now only used on folios, so we can remove all the page accessors. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-03mm: remove PageSwapBackedMatthew Wilcox (Oracle)1-3/+3
This flag is now only used on folios, so we can remove all the page accessors. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-03mm: remove PageActiveMatthew Wilcox (Oracle)1-2/+3
Patch series "Simplify the page flags a little". In the course of our folio conversions, we have made many page flags only used on folios, so we can now remove the page-based accessors. This should cut down compile time a little, and prevent new users from cropping up. There is more that could be done in this area, but it would produce merge conflicts, so I'll sit on those patches until next merge window. We now have line of sight to removing PG_private_2 and PG_private. This patch (of 10): This flag is now only used on folios, so we can remove all the page accessors. [[email protected]: fix arch/powerpc/mm/pgtable-frag.c] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-03mm: support only one page_type per pageMatthew Wilcox (Oracle)1-40/+28
By using a few values in the top byte, users of page_type can store up to 24 bits of additional data in page_type. It also reduces the code size as (with replacement of READ_ONCE() with data_race()), the kernel can check just a single byte. eg: ffffffff811e3a79: 8b 47 30 mov 0x30(%rdi),%eax ffffffff811e3a7c: 55 push %rbp ffffffff811e3a7d: 48 89 e5 mov %rsp,%rbp ffffffff811e3a80: 25 00 00 00 82 and $0x82000000,%eax ffffffff811e3a85: 3d 00 00 00 80 cmp $0x80000000,%eax ffffffff811e3a8a: 74 4d je ffffffff811e3ad9 <folio_mapping+0x69> becomes: ffffffff811e3a69: 80 7f 33 f5 cmpb $0xf5,0x33(%rdi) ffffffff811e3a6d: 55 push %rbp ffffffff811e3a6e: 48 89 e5 mov %rsp,%rbp ffffffff811e3a71: 74 4d je ffffffff811e3ac0 <folio_mapping+0x60> replacing three instructions with one. [[email protected]: fix ubsan warnings] Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Hyeonggon Yoo <[email protected]> Cc: Kent Overstreet <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-03mm: introduce page_mapcount_is_type()Matthew Wilcox (Oracle)2-5/+10
Resolve the awkward "and add one to this opaque constant" test into a self-documenting inline function. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Hyeonggon Yoo <[email protected]> Cc: Kent Overstreet <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-03printf: remove %pGt supportMatthew Wilcox (Oracle)1-10/+0
Patch series "Increase the number of bits available in page_type". Kent wants more than 16 bits in page_type, so I resurrected this old patch and expanded it a bit. It's a bit more efficient than our current scheme (1 4-byte insn vs 3 insns of 13 bytes total) to test a single page type. This patch (of 4): An upcoming patch will convert page type from being a bitfield to a single byte, so we will not be able to use %pG to print the page type any more. The printing of the symbolic name will be restored in that patch. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Matthew Wilcox (Oracle) <[email protected]> Acked-by: David Hildenbrand <[email protected]> Cc: Hyeonggon Yoo <[email protected]> Cc: Kent Overstreet <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-03mm: allow read-ahead with IOCB_NOWAIT setYafang Shao1-1/+0
Readahead support for IOCB_NOWAIT was introduced in commit 2e85abf053b9 ("mm: allow read-ahead with IOCB_NOWAIT set"). However, this implementation broke the semantics of IOCB_NOWAIT by potentially causing it to wait on I/O during memory reclamation. This behavior was later modified in commit efa8480a8316 ("fs: RWF_NOWAIT should imply IOCB_NOIO"). To resolve the blocking issue during memory reclamation, we can use memalloc_noio_{save,restore} to ensure non-blocking behavior. This change restores the original functionality, allowing preadv2(IOCB_NOWAIT) to trigger readahead if the file content is not present in the page cache. While this process may trigger direct memory reclamation, the __GFP_NORETRY flag is set in the readahead GFP flags, ensuring it won't block. A use case for this change is when we want to trigger readahead in the preadv2(2) syscall if the file cache is absent, but without waiting for certain filesystem locks, like xfs_ilock. A simple example is as follows: retry: if (preadv2(fd, iovec, cnt, offset, RWF_NOWAIT) < 0) { do_other_work(); goto retry; } Link: https://lore.gnuweeb.org/io-uring/[email protected]/ Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yafang Shao <[email protected]> Cc: Jens Axboe <[email protected]> Cc: Matthew Wilcox <[email protected]> Cc: Dave Chinner <[email protected]> Cc: Jan Kara <[email protected]> Cc: Christian Brauner <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-03mm: always inline _compound_head() with CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP=yDavid Hildenbrand1-1/+1
We already force-inline page_fixed_fake_head(), page_is_fake_head() and PageTail(), however the compiler might decide that _compound_head() is not worthy to be inlined, because of page_fixed_fake_head(). The result is that, for example, PageAnonExclusive() now might involve a function call when checking PageHuge(), which performs a page_folio()->_compound_head() call. This can lead to a slight regression of the stress-ng.clone benchmark. This is not super-urgent to fix, but always inlining _compound_head() seems like the obvious thing to do for this primitive, similar to the other ones. This change restores the slight regression and a compilation with CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP=y shows no relevant bloat [2]: add/remove: 15/14 grow/shrink: 79/87 up/down: 12836/-13917 (-1081) ... Total: Before=32786363, After=32785282, chg -0.00% [1] https://lkml.kernel.org/r/[email protected] [2] https://lore.kernel.org/all/[email protected]/ Link: https://lkml.kernel.org/r/[email protected] Fixes: c0bff412e67b ("mm: allow anon exclusive check over hugetlb tail pages") Signed-off-by: David Hildenbrand <[email protected]> Reported-by: kernel test robot <[email protected]> Closes: https://lore.kernel.org/oe-lkp/[email protected] Tested-by: Yin Fengwei <[email protected]> Cc: Peter Xu <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-03err.h: add ERR_PTR_PCPU(), PTR_ERR_PCPU() and IS_ERR_PCPU() macrosUros Bizjak1-0/+9
Add ERR_PTR_PCPU(), PTR_ERR_PCPU() and IS_ERR_PCPU() macros that operate on pointers in the percpu address space. These macros remove the need for (__force void *) function argument casts (to avoid sparse -Wcast-from-as warnings). The patch will also avoid future build errors due to pointer address space mismatch with enabled strict percpu address space checks. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Uros Bizjak <[email protected]> Acked-by: Catalin Marinas <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-03mm: krealloc: clarify valid usage of __GFP_ZERODanilo Krummrich1-0/+10
Properly document that if __GFP_ZERO logic is requested, callers must ensure that, starting with the initial memory allocation, every subsequent call to this API for the same memory allocation is flagged with __GFP_ZERO. Otherwise, it is possible that __GFP_ZERO is not fully honored by this API. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Danilo Krummrich <[email protected]> Acked-by: David Rientjes <[email protected]> Cc: Christoph Lameter <[email protected]> Cc: Hyeonggon Yoo <[email protected]> Cc: Joonsoo Kim <[email protected]> Cc: Pekka Enberg <[email protected]> Cc: Roman Gushchin <[email protected]> Cc: Vlastimil Babka <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-03mm/rmap: use folio->_mapcount for small foliosDavid Hildenbrand1-2/+2
We have some cases left whereby we operate on small folios and still refer to page->_mapcount. Let's just use folio->_mapcount instead, which currently still overlays page->_mapcount, so no change. This change will make it easier to later spot any remaining users of page->_mapcount that target tail pages. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: David Hildenbrand <[email protected]> Cc: Matthew Wilcox <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2024-09-03mm/hugetlb: use __GFP_COMP for gigantic foliosYu Zhao1-4/+5
Use __GFP_COMP for gigantic folios to greatly reduce not only the amount of code but also the allocation and free time. LOC (approximately): +60, -240 Allocate and free 500 1GB hugeTLB memory without HVO by: time echo 500 >/sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages time echo 0 >/sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages Before After Alloc ~13s ~10s Free ~15s <1s The above magnitude generally holds for multiple x86 and arm64 CPU models. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Yu Zhao <[email protected]> Reported-by: Frank van der Linden <[email protected]> Acked-by: Zi Yan <[email protected]> Cc: Matthew Wilcox (Oracle) <[email protected]> Cc: Muchun Song <[email protected]> Signed-off-by: Andrew Morton <[email protected]>