aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2023-03-24module: move check_modinfo() early to early_mod_check()Luis Chamberlain1-4/+4
This moves check_modinfo() to early_mod_check(). This doesn't make any functional changes either, as check_modinfo() was the first call on layout_and_allocate(), so we're just moving it back one routine and at the end. This let's us keep separate the checkers from the allocator. Signed-off-by: Luis Chamberlain <[email protected]>
2023-03-24module: move early sanity checks into a helperLuis Chamberlain1-17/+26
Move early sanity checkers for the module into a helper. This let's us make it clear when we are working with the local copy of the module prior to allocation. This produces no functional changes, it just makes subsequent changes easier to read. Signed-off-by: Luis Chamberlain <[email protected]>
2023-03-24module: add a for_each_modinfo_entry()Luis Chamberlain2-4/+4
Add a for_each_modinfo_entry() to make it easier to read and use. This produces no functional changes but makes this code easiert to read as we are used to with loops in the kernel and trims more lines of code. Signed-off-by: Luis Chamberlain <[email protected]>
2023-03-24module: rename next_string() to module_next_tag_pair()Luis Chamberlain2-3/+5
This makes it clearer what it is doing. While at it, make it available to other code other than main.c. This will be used in the subsequent patch and make the changes easier to read. Signed-off-by: Luis Chamberlain <[email protected]>
2023-03-24module: move get_modinfo() helpers all aboveLuis Chamberlain1-52/+48
Instead of forward declaring routines for get_modinfo() just move everything up. This makes no functional changes. Signed-off-by: Luis Chamberlain <[email protected]>
2023-03-24Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski8-21/+20
Conflicts: drivers/net/ethernet/mellanox/mlx5/core/en_tc.c 6e9d51b1a5cb ("net/mlx5e: Initialize link speed to zero") 1bffcea42926 ("net/mlx5e: Add devlink hairpin queues parameters") https://lore.kernel.org/all/[email protected]/ https://lore.kernel.org/all/[email protected]/ Adjacent changes: drivers/net/phy/phy.c 323fe43cf9ae ("net: phy: Improved PHY error reporting in state machine") 4203d84032e2 ("net: phy: Ensure state transitions are processed from phy_stop()") Signed-off-by: Jakub Kicinski <[email protected]>
2023-03-24kernel/ksysfs.c: use sysfs_emit for sysfs show handlersThomas Weißschuh1-11/+11
sysfs_emit() is the recommended way to format strings for sysfs as per Documentation/filesystems/sysfs.rst. Signed-off-by: Thomas Weißschuh <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]>
2023-03-24Merge tag 'net-6.3-rc4' of ↵Linus Torvalds2-2/+11
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from bpf, wifi and bluetooth. Current release - regressions: - wifi: mt76: mt7915: add back 160MHz channel width support for MT7915 - libbpf: revert poisoning of strlcpy, it broke uClibc-ng Current release - new code bugs: - bpf: improve the coverage of the "allow reads from uninit stack" feature to fix verification complexity problems - eth: am65-cpts: reset PPS genf adj settings on enable Previous releases - regressions: - wifi: mac80211: serialize ieee80211_handle_wake_tx_queue() - wifi: mt76: do not run mt76_unregister_device() on unregistered hw, fix null-deref - Bluetooth: btqcomsmd: fix command timeout after setting BD address - eth: igb: revert rtnl_lock() that causes a deadlock - dsa: mscc: ocelot: fix device specific statistics Previous releases - always broken: - xsk: add missing overflow check in xdp_umem_reg() - wifi: mac80211: - fix QoS on mesh interfaces - fix mesh path discovery based on unicast packets - Bluetooth: - ISO: fix timestamped HCI ISO data packet parsing - remove "Power-on" check from Mesh feature - usbnet: more fixes to drivers trusting packet length - wifi: iwlwifi: mvm: fix mvmtxq->stopped handling - Bluetooth: btintel: iterate only bluetooth device ACPI entries - eth: iavf: fix inverted Rx hash condition leading to disabled hash - eth: igc: fix the validation logic for taprio's gate list - dsa: tag_brcm: legacy: fix daisy-chained switches Misc: - bpf: adjust insufficient default bpf_jit_limit to account for growth of BPF use over the last 5 years - xdp: bpf_xdp_metadata() use EOPNOTSUPP as unique errno indicating no driver support" * tag 'net-6.3-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (84 commits) Bluetooth: HCI: Fix global-out-of-bounds Bluetooth: mgmt: Fix MGMT add advmon with RSSI command Bluetooth: btsdio: fix use after free bug in btsdio_remove due to unfinished work Bluetooth: L2CAP: Fix responding with wrong PDU type Bluetooth: btqcomsmd: Fix command timeout after setting BD address Bluetooth: btinel: Check ACPI handle for NULL before accessing net: mdio: thunder: Add missing fwnode_handle_put() net: dsa: mt7530: move setting ssc_delta to PHY_INTERFACE_MODE_TRGMII case net: dsa: mt7530: move lowering TRGMII driving to mt7530_setup() net: dsa: mt7530: move enabling disabling core clock to mt7530_pll_setup() net: asix: fix modprobe "sysfs: cannot create duplicate filename" gve: Cache link_speed value from device tools: ynl: Fix genlmsg header encoding formats net: enetc: fix aggregate RMON counters not showing the ranges Bluetooth: Remove "Power-on" check from Mesh feature Bluetooth: Fix race condition in hci_cmd_sync_clear Bluetooth: btintel: Iterate only bluetooth device ACPI entries Bluetooth: ISO: fix timestamped HCI ISO data packet parsing Bluetooth: btusb: Remove detection of ISO packets over bulk Bluetooth: hci_core: Detect if an ACL packet is in fact an ISO packet ...
2023-03-24trace,smp: Trace all smp_function_call*() invocationsPeter Zijlstra1-30/+36
(Ab)use the trace_ipi_send_cpu*() family to trace all smp_function_call*() invocations, not only those that result in an actual IPI. The queued entries log their callback function while the actual IPIs are traced on generic_smp_call_function_single_interrupt(). Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
2023-03-24trace: Add trace_ipi_send_cpu()Peter Zijlstra3-6/+5
Because copying cpumasks around when targeting a single CPU is a bit daft... Tested-and-reviewed-by: Valentin Schneider <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/20230322103004.GA571242%40hirez.programming.kicks-ass.net
2023-03-24sched, smp: Trace smp callback causing an IPIValentin Schneider3-16/+53
Context ======= The newly-introduced ipi_send_cpumask tracepoint has a "callback" parameter which so far has only been fed with NULL. While CSD_TYPE_SYNC/ASYNC and CSD_TYPE_IRQ_WORK share a similar backing struct layout (meaning their callback func can be accessed without caring about the actual CSD type), CSD_TYPE_TTWU doesn't even have a function attached to its struct. This means we need to check the type of a CSD before eventually dereferencing its associated callback. This isn't as trivial as it sounds: the CSD type is stored in __call_single_node.u_flags, which get cleared right before the callback is executed via csd_unlock(). This implies checking the CSD type before it is enqueued on the call_single_queue, as the target CPU's queue can be flushed before we get to sending an IPI. Furthermore, send_call_function_single_ipi() only has a CPU parameter, and would need to have an additional argument to trickle down the invoked function. This is somewhat silly, as the extra argument will always be pushed down to the function even when nothing is being traced, which is unnecessary overhead. Changes ======= send_call_function_single_ipi() is only used by smp.c, and is defined in sched/core.c as it contains scheduler-specific ops (set_nr_if_polling() of a CPU's idle task). Split it into two parts: the scheduler bits remain in sched/core.c, and the actual IPI emission is moved into smp.c. This lets us define an __always_inline helper function that can take the related callback as parameter without creating useless register pressure in the non-traced path which only gains a (disabled) static branch. Do the same thing for the multi IPI case. Signed-off-by: Valentin Schneider <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-03-24smp: reword smp call IPI commentValentin Schneider1-3/+4
Accessing the call_single_queue hasn't involved a spinlock since 2014: 6897fc22ea01 ("kernel: use lockless list for smp_call_function_single") The llist operations (namely cmpxchg() and xchg()) provide similar ordering guarantees, update the comment to lessen confusion. Signed-off-by: Valentin Schneider <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-03-24irq_work: Trace self-IPIs sent via arch_irq_work_raise()Valentin Schneider1-1/+13
IPIs sent to remote CPUs via irq_work_queue_on() are now covered by trace_ipi_send_cpumask(), add another instance of the tracepoint to cover self-IPIs. Signed-off-by: Valentin Schneider <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Steven Rostedt (Google) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-03-24smp: Trace IPIs sent via arch_send_call_function_ipi_mask()Valentin Schneider1-1/+8
This simply wraps around the arch function and prepends it with a tracepoint, similar to send_call_function_single_ipi(). Signed-off-by: Valentin Schneider <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Steven Rostedt (Google) <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-03-24sched, smp: Trace IPIs sent via send_call_function_single_ipi()Valentin Schneider2-2/+9
send_call_function_single_ipi() is the thing that sends IPIs at the bottom of smp_call_function*() via either generic_exec_single() or smp_call_function_many_cond(). Give it an IPI-related tracepoint. Note that this ends up tracing any IPI sent via __smp_call_single_queue(), which covers __ttwu_queue_wakelist() and irq_work_queue_on() "for free". Signed-off-by: Valentin Schneider <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Steven Rostedt (Google) <[email protected]> Acked-by: Ingo Molnar <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-03-24kernel/smp: Make csdlock_debug= resettablePaul E. McKenney1-3/+8
It is currently possible to set the csdlock_debug_enabled static branch, but not to reset it. This is an issue when several different entities supply kernel boot parameters and also for kernels built with CONFIG_CSD_LOCK_WAIT_DEBUG_DEFAULT=y. Therefore, make the csdlock_debug=0 kernel boot parameter turn off debugging. Last one wins! Reported-by: Jes Sorensen <[email protected]> Signed-off-by: Paul E. McKenney <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Juergen Gross <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-03-24locking/csd_lock: Remove per-CPU data indirection from CSD lock debuggingPaul E. McKenney1-10/+6
The diagnostics added by this commit were extremely useful in one instance: a5aabace5fb8 ("locking/csd_lock: Add more data to CSD lock debugging") However, they have not seen much action since, and there have been some concerns expressed that the complexity is not worth the benefit. Therefore, manually revert the following commit preparatory commit: de7b09ef658d ("locking/csd_lock: Prepare more CSD lock debugging") Signed-off-by: Paul E. McKenney <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Juergen Gross <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-03-24locking/csd_lock: Remove added data from CSD lock debuggingPaul E. McKenney1-221/+12
The diagnostics added by this commit were extremely useful in one instance: a5aabace5fb8 ("locking/csd_lock: Add more data to CSD lock debugging") However, they have not seen much action since, and there have been some concerns expressed that the complexity is not worth the benefit. Therefore, manually revert this commit, but leave a comment telling people where to find these diagnostics. [ paulmck: Apply Juergen Gross feedback. ] Signed-off-by: Paul E. McKenney <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Juergen Gross <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-03-24locking/csd_lock: Add Kconfig option for csd_debug defaultPaul E. McKenney1-1/+1
The csd_debug kernel parameter works well, but is inconvenient in cases where it is more closely associated with boot loaders or automation than with a particular kernel version or release. Thererfore, provide a new CSD_LOCK_WAIT_DEBUG_DEFAULT Kconfig option that defaults csd_debug to 1 when selected and 0 otherwise, with this latter being the default. Signed-off-by: Paul E. McKenney <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Acked-by: Juergen Gross <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-03-23cpuset: Clean up cpuset_node_allowedHaifeng Xu1-2/+2
Commit 002f290627c2 ("cpuset: use static key better and convert to new API") has used __cpuset_node_allowed() instead of cpuset_node_allowed() to check whether we can allocate on a memory node. Now this function isn't used by anyone, so we can do the follow things to clean up it. 1. remove unused codes 2. rename __cpuset_node_allowed() to cpuset_node_allowed() 3. update comments in mm/page_alloc.c Suggested-by: Waiman Long <[email protected]> Signed-off-by: Haifeng Xu <[email protected]> Acked-by: Waiman Long <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2023-03-23workqueue: Introduce show_freezable_workqueuesJungseung Lee2-3/+25
Currently show_all_workqueue is called if freeze fails at the time of freeze the workqueues, which shows the status of all workqueues and of all worker pools. In this cases we may only need to dump state of only workqueues that are freezable and busy. This patch defines show_freezable_workqueues, which uses show_one_workqueue, a granular function that shows the state of individual workqueues, so that dump only the state of freezable workqueues at that time. tj: Minor message adjustment. Signed-off-by: Jungseung Lee <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2023-03-23kcsan: avoid passing -g for testMarco Elver1-1/+1
Nathan reported that when building with GNU as and a version of clang that defaults to DWARF5, the assembler will complain with: Error: non-constant .uleb128 is not supported This is because `-g` defaults to the compiler debug info default. If the assembler does not support some of the directives used, the above errors occur. To fix, remove the explicit passing of `-g`. All the test wants is that stack traces print valid function names, and debug info is not required for that. (I currently cannot recall why I added the explicit `-g`.) Link: https://lkml.kernel.org/r/[email protected] Fixes: 1fe84fd4a402 ("kcsan: Add test suite") Signed-off-by: Marco Elver <[email protected]> Reported-by: Nathan Chancellor <[email protected]> Cc: Alexander Potapenko <[email protected]> Cc: Dmitry Vyukov <[email protected]> Cc: <[email protected]> Signed-off-by: Andrew Morton <[email protected]>
2023-03-23Merge tag 'for-netdev' of ↵Jakub Kicinski2-2/+11
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Daniel Borkmann says: ==================== pull-request: bpf 2023-03-23 We've added 8 non-merge commits during the last 13 day(s) which contain a total of 21 files changed, 238 insertions(+), 161 deletions(-). The main changes are: 1) Fix verification issues in some BPF programs due to their stack usage patterns, from Eduard Zingerman. 2) Fix to add missing overflow checks in xdp_umem_reg and return an error in such case, from Kal Conley. 3) Fix and undo poisoning of strlcpy in libbpf given it broke builds for libcs which provided the former like uClibc-ng, from Jesus Sanchez-Palencia. 4) Fix insufficient bpf_jit_limit default to avoid users running into hard to debug seccomp BPF errors, from Daniel Borkmann. 5) Fix driver return code when they don't support a bpf_xdp_metadata kfunc to make it unambiguous from other errors, from Jesper Dangaard Brouer. 6) Two BPF selftest fixes to address compilation errors from recent changes in kernel structures, from Alexei Starovoitov. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: xdp: bpf_xdp_metadata use EOPNOTSUPP for no driver support bpf: Adjust insufficient default bpf_jit_limit xsk: Add missing overflow check in xdp_umem_reg selftests/bpf: Fix progs/test_deny_namespace.c issues. selftests/bpf: Fix progs/find_vma_fail1.c build error. libbpf: Revert poisoning of strlcpy selftests/bpf: Tests for uninitialized stack reads bpf: Allow reads from uninit stack ==================== Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
2023-03-23vhost_task: Allow vhost layer to use copy_processMike Christie2-0/+118
Qemu will create vhost devices in the kernel which perform network, SCSI, etc IO and management operations from worker threads created by the kthread API. Because the kthread API does a copy_process on the kthreadd thread, the vhost layer has to use kthread_use_mm to access the Qemu thread's memory and cgroup_attach_task_all to add itself to the Qemu thread's cgroups, and it bypasses the RLIMIT_NPROC limit which can result in VMs creating more threads than the admin expected. This patch adds a new struct vhost_task which can be used instead of kthreads. They allow the vhost layer to use copy_process and inherit the userspace process's mm and cgroups, the task is accounted for under the userspace's nproc count and can be seen in its process tree, and other features like namespaces work and are inherited by default. Signed-off-by: Mike Christie <[email protected]> Acked-by: Michael S. Tsirkin <[email protected]> Signed-off-by: Christian Brauner (Microsoft) <[email protected]> Signed-off-by: Christian Brauner <[email protected]>
2023-03-22bpf: Update the struct_ops of a bpf_link.Kui-Feng Lee2-1/+81
By improving the BPF_LINK_UPDATE command of bpf(), it should allow you to conveniently switch between different struct_ops on a single bpf_link. This would enable smoother transitions from one struct_ops to another. The struct_ops maps passing along with BPF_LINK_UPDATE should have the BPF_F_LINK flag. Signed-off-by: Kui-Feng Lee <[email protected]> Acked-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Martin KaFai Lau <[email protected]>
2023-03-22bpf: Create links for BPF struct_ops maps.Kui-Feng Lee2-11/+155
Make bpf_link support struct_ops. Previously, struct_ops were always used alone without any associated links. Upon updating its value, a struct_ops would be activated automatically. Yet other BPF program types required to make a bpf_link with their instances before they could become active. Now, however, you can create an inactive struct_ops, and create a link to activate it later. With bpf_links, struct_ops has a behavior similar to other BPF program types. You can pin/unpin them from their links and the struct_ops will be deactivated when its link is removed while previously need someone to delete the value for it to be deactivated. bpf_links are responsible for registering their associated struct_ops. You can only use a struct_ops that has the BPF_F_LINK flag set to create a bpf_link, while a structs without this flag behaves in the same manner as before and is registered upon updating its value. The BPF_LINK_TYPE_STRUCT_OPS serves a dual purpose. Not only is it used to craft the links for BPF struct_ops programs, but also to create links for BPF struct_ops them-self. Since the links of BPF struct_ops programs are only used to create trampolines internally, they are never seen in other contexts. Thus, they can be reused for struct_ops themself. To maintain a reference to the map supporting this link, we add bpf_struct_ops_link as an additional type. The pointer of the map is RCU and won't be necessary until later in the patchset. Signed-off-by: Kui-Feng Lee <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Martin KaFai Lau <[email protected]>
2023-03-22bpf: Retire the struct_ops map kvalue->refcnt.Kui-Feng Lee2-35/+48
We have replaced kvalue-refcnt with synchronize_rcu() to wait for an RCU grace period. Maintenance of kvalue->refcnt was a complicated task, as we had to simultaneously keep track of two reference counts: one for the reference count of bpf_map. When the kvalue->refcnt reaches zero, we also have to reduce the reference count on bpf_map - yet these steps are not performed in an atomic manner and require us to be vigilant when managing them. By eliminating kvalue->refcnt, we can make our maintenance more straightforward as the refcount of bpf_map is now solely managed! To prevent the trampoline image of a struct_ops from being released while it is still in use, we wait for an RCU grace period. The setsockopt(TCP_CONGESTION, "...") command allows you to change your socket's congestion control algorithm and can result in releasing the old struct_ops implementation. It is fine. However, this function is exposed through bpf_setsockopt(), it may be accessed by BPF programs as well. To ensure that the trampoline image belonging to struct_op can be safely called while its method is in use, the trampoline safeguarde the BPF program with rcu_read_lock(). Doing so prevents any destruction of the associated images before returning from a trampoline and requires us to wait for an RCU grace period. Signed-off-by: Kui-Feng Lee <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Martin KaFai Lau <[email protected]>
2023-03-22bpf: remember meta->iter info only for initialized itersAndrii Nakryiko1-7/+7
For iter_new() functions iterator state's slot might not be yet initialized, in which case iter_get_spi() will return -ERANGE. This is expected and is handled properly. But for iter_next() and iter_destroy() cases iter slot is supposed to be initialized and correct, so -ERANGE is not possible. Move meta->iter.{spi,frameno} initialization into iter_next/iter_destroy handling branch to make it more explicit that valid information will be remembered in meta->iter block for subsequent use in process_iter_next_call(), avoiding confusingly looking -ERANGE assignment for meta->iter.spi. Reported-by: Dan Carpenter <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Martin KaFai Lau <[email protected]>
2023-03-22bpf: Fix __reg_bound_offset 64->32 var_off subreg propagationDaniel Borkmann1-3/+3
Xu reports that after commit 3f50f132d840 ("bpf: Verifier, do explicit ALU32 bounds tracking"), the following BPF program is rejected by the verifier: 0: (61) r2 = *(u32 *)(r1 +0) ; R2_w=pkt(off=0,r=0,imm=0) 1: (61) r3 = *(u32 *)(r1 +4) ; R3_w=pkt_end(off=0,imm=0) 2: (bf) r1 = r2 3: (07) r1 += 1 4: (2d) if r1 > r3 goto pc+8 5: (71) r1 = *(u8 *)(r2 +0) ; R1_w=scalar(umax=255,var_off=(0x0; 0xff)) 6: (18) r0 = 0x7fffffffffffff10 8: (0f) r1 += r0 ; R1_w=scalar(umin=0x7fffffffffffff10,umax=0x800000000000000f) 9: (18) r0 = 0x8000000000000000 11: (07) r0 += 1 12: (ad) if r0 < r1 goto pc-2 13: (b7) r0 = 0 14: (95) exit And the verifier log says: func#0 @0 0: R1=ctx(off=0,imm=0) R10=fp0 0: (61) r2 = *(u32 *)(r1 +0) ; R1=ctx(off=0,imm=0) R2_w=pkt(off=0,r=0,imm=0) 1: (61) r3 = *(u32 *)(r1 +4) ; R1=ctx(off=0,imm=0) R3_w=pkt_end(off=0,imm=0) 2: (bf) r1 = r2 ; R1_w=pkt(off=0,r=0,imm=0) R2_w=pkt(off=0,r=0,imm=0) 3: (07) r1 += 1 ; R1_w=pkt(off=1,r=0,imm=0) 4: (2d) if r1 > r3 goto pc+8 ; R1_w=pkt(off=1,r=1,imm=0) R3_w=pkt_end(off=0,imm=0) 5: (71) r1 = *(u8 *)(r2 +0) ; R1_w=scalar(umax=255,var_off=(0x0; 0xff)) R2_w=pkt(off=0,r=1,imm=0) 6: (18) r0 = 0x7fffffffffffff10 ; R0_w=9223372036854775568 8: (0f) r1 += r0 ; R0_w=9223372036854775568 R1_w=scalar(umin=9223372036854775568,umax=9223372036854775823,s32_min=-240,s32_max=15) 9: (18) r0 = 0x8000000000000000 ; R0_w=-9223372036854775808 11: (07) r0 += 1 ; R0_w=-9223372036854775807 12: (ad) if r0 < r1 goto pc-2 ; R0_w=-9223372036854775807 R1_w=scalar(umin=9223372036854775568,umax=9223372036854775809) 13: (b7) r0 = 0 ; R0_w=0 14: (95) exit from 12 to 11: R0_w=-9223372036854775807 R1_w=scalar(umin=9223372036854775810,umax=9223372036854775823,var_off=(0x8000000000000000; 0xffffffff)) R2_w=pkt(off=0,r=1,imm=0) R3_w=pkt_end(off=0,imm=0) R10=fp0 11: (07) r0 += 1 ; R0_w=-9223372036854775806 12: (ad) if r0 < r1 goto pc-2 ; R0_w=-9223372036854775806 R1_w=scalar(umin=9223372036854775810,umax=9223372036854775810,var_off=(0x8000000000000000; 0xffffffff)) 13: safe [...] from 12 to 11: R0_w=-9223372036854775795 R1=scalar(umin=9223372036854775822,umax=9223372036854775823,var_off=(0x8000000000000000; 0xffffffff)) R2=pkt(off=0,r=1,imm=0) R3=pkt_end(off=0,imm=0) R10=fp0 11: (07) r0 += 1 ; R0_w=-9223372036854775794 12: (ad) if r0 < r1 goto pc-2 ; R0_w=-9223372036854775794 R1=scalar(umin=9223372036854775822,umax=9223372036854775822,var_off=(0x8000000000000000; 0xffffffff)) 13: safe from 12 to 11: R0_w=-9223372036854775794 R1=scalar(umin=9223372036854775823,umax=9223372036854775823,var_off=(0x8000000000000000; 0xffffffff)) R2=pkt(off=0,r=1,imm=0) R3=pkt_end(off=0,imm=0) R10=fp0 11: (07) r0 += 1 ; R0_w=-9223372036854775793 12: (ad) if r0 < r1 goto pc-2 ; R0_w=-9223372036854775793 R1=scalar(umin=9223372036854775823,umax=9223372036854775823,var_off=(0x8000000000000000; 0xffffffff)) 13: safe from 12 to 11: R0_w=-9223372036854775793 R1=scalar(umin=9223372036854775824,umax=9223372036854775823,var_off=(0x8000000000000000; 0xffffffff)) R2=pkt(off=0,r=1,imm=0) R3=pkt_end(off=0,imm=0) R10=fp0 11: (07) r0 += 1 ; R0_w=-9223372036854775792 12: (ad) if r0 < r1 goto pc-2 ; R0_w=-9223372036854775792 R1=scalar(umin=9223372036854775824,umax=9223372036854775823,var_off=(0x8000000000000000; 0xffffffff)) 13: safe [...] The 64bit umin=9223372036854775810 bound continuously bumps by +1 while umax=9223372036854775823 stays as-is until the verifier complexity limit is reached and the program gets finally rejected. During this simulation, the umin also eventually surpasses umax. Looking at the first 'from 12 to 11' output line from the loop, R1 has the following state: R1_w=scalar(umin=0x8000000000000002 (9223372036854775810), umax=0x800000000000000f (9223372036854775823), var_off=(0x8000000000000000; 0xffffffff)) The var_off has technically not an inconsistent state but it's very imprecise and far off surpassing 64bit umax bounds whereas the expected output with refined known bits in var_off should have been like: R1_w=scalar(umin=0x8000000000000002 (9223372036854775810), umax=0x800000000000000f (9223372036854775823), var_off=(0x8000000000000000; 0xf)) In the above log, var_off stays as var_off=(0x8000000000000000; 0xffffffff) and does not converge into a narrower mask where more bits become known, eventually transforming R1 into a constant upon umin=9223372036854775823, umax=9223372036854775823 case where the verifier would have terminated and let the program pass. The __reg_combine_64_into_32() marks the subregister unknown and propagates 64bit {s,u}min/{s,u}max bounds to their 32bit equivalents iff they are within the 32bit universe. The question came up whether __reg_combine_64_into_32() should special case the situation that when 64bit {s,u}min bounds have the same value as 64bit {s,u}max bounds to then assign the latter as well to the 32bit reg->{s,u}32_{min,max}_value. As can be seen from the above example however, that is just /one/ special case and not a /generic/ solution given above example would still not be addressed this way and remain at an imprecise var_off=(0x8000000000000000; 0xffffffff). The improvement is needed in __reg_bound_offset() to refine var32_off with the updated var64_off instead of the prior reg->var_off. The reg_bounds_sync() code first refines information about the register's min/max bounds via __update_reg_bounds() from the current var_off, then in __reg_deduce_bounds() from sign bit and with the potentially learned bits from bounds it'll update the var_off tnum in __reg_bound_offset(). For example, intersecting with the old var_off might have improved bounds slightly, e.g. if umax was 0x7f...f and var_off was (0; 0xf...fc), then new var_off will then result in (0; 0x7f...fc). The intersected var64_off holds then the universe which is a superset of var32_off. The point for the latter is not to broaden, but to further refine known bits based on the intersection of var_off with 32 bit bounds, so that we later construct the final var_off from upper and lower 32 bits. The final __update_reg_bounds() can then potentially still slightly refine bounds if more bits became known from the new var_off. After the improvement, we can see R1 converging successively: func#0 @0 0: R1=ctx(off=0,imm=0) R10=fp0 0: (61) r2 = *(u32 *)(r1 +0) ; R1=ctx(off=0,imm=0) R2_w=pkt(off=0,r=0,imm=0) 1: (61) r3 = *(u32 *)(r1 +4) ; R1=ctx(off=0,imm=0) R3_w=pkt_end(off=0,imm=0) 2: (bf) r1 = r2 ; R1_w=pkt(off=0,r=0,imm=0) R2_w=pkt(off=0,r=0,imm=0) 3: (07) r1 += 1 ; R1_w=pkt(off=1,r=0,imm=0) 4: (2d) if r1 > r3 goto pc+8 ; R1_w=pkt(off=1,r=1,imm=0) R3_w=pkt_end(off=0,imm=0) 5: (71) r1 = *(u8 *)(r2 +0) ; R1_w=scalar(umax=255,var_off=(0x0; 0xff)) R2_w=pkt(off=0,r=1,imm=0) 6: (18) r0 = 0x7fffffffffffff10 ; R0_w=9223372036854775568 8: (0f) r1 += r0 ; R0_w=9223372036854775568 R1_w=scalar(umin=9223372036854775568,umax=9223372036854775823,s32_min=-240,s32_max=15) 9: (18) r0 = 0x8000000000000000 ; R0_w=-9223372036854775808 11: (07) r0 += 1 ; R0_w=-9223372036854775807 12: (ad) if r0 < r1 goto pc-2 ; R0_w=-9223372036854775807 R1_w=scalar(umin=9223372036854775568,umax=9223372036854775809) 13: (b7) r0 = 0 ; R0_w=0 14: (95) exit from 12 to 11: R0_w=-9223372036854775807 R1_w=scalar(umin=9223372036854775810,umax=9223372036854775823,var_off=(0x8000000000000000; 0xf),s32_min=0,s32_max=15,u32_max=15) R2_w=pkt(off=0,r=1,imm=0) R3_w=pkt_end(off=0,imm=0) R10=fp0 11: (07) r0 += 1 ; R0_w=-9223372036854775806 12: (ad) if r0 < r1 goto pc-2 ; R0_w=-9223372036854775806 R1_w=-9223372036854775806 13: safe from 12 to 11: R0_w=-9223372036854775806 R1_w=scalar(umin=9223372036854775811,umax=9223372036854775823,var_off=(0x8000000000000000; 0xf),s32_min=0,s32_max=15,u32_max=15) R2_w=pkt(off=0,r=1,imm=0) R3_w=pkt_end(off=0,imm=0) R10=fp0 11: (07) r0 += 1 ; R0_w=-9223372036854775805 12: (ad) if r0 < r1 goto pc-2 ; R0_w=-9223372036854775805 R1_w=-9223372036854775805 13: safe [...] from 12 to 11: R0_w=-9223372036854775798 R1=scalar(umin=9223372036854775819,umax=9223372036854775823,var_off=(0x8000000000000008; 0x7),s32_min=8,s32_max=15,u32_min=8,u32_max=15) R2=pkt(off=0,r=1,imm=0) R3=pkt_end(off=0,imm=0) R10=fp0 11: (07) r0 += 1 ; R0_w=-9223372036854775797 12: (ad) if r0 < r1 goto pc-2 ; R0_w=-9223372036854775797 R1=-9223372036854775797 13: safe from 12 to 11: R0_w=-9223372036854775797 R1=scalar(umin=9223372036854775820,umax=9223372036854775823,var_off=(0x800000000000000c; 0x3),s32_min=12,s32_max=15,u32_min=12,u32_max=15) R2=pkt(off=0,r=1,imm=0) R3=pkt_end(off=0,imm=0) R10=fp0 11: (07) r0 += 1 ; R0_w=-9223372036854775796 12: (ad) if r0 < r1 goto pc-2 ; R0_w=-9223372036854775796 R1=-9223372036854775796 13: safe from 12 to 11: R0_w=-9223372036854775796 R1=scalar(umin=9223372036854775821,umax=9223372036854775823,var_off=(0x800000000000000c; 0x3),s32_min=12,s32_max=15,u32_min=12,u32_max=15) R2=pkt(off=0,r=1,imm=0) R3=pkt_end(off=0,imm=0) R10=fp0 11: (07) r0 += 1 ; R0_w=-9223372036854775795 12: (ad) if r0 < r1 goto pc-2 ; R0_w=-9223372036854775795 R1=-9223372036854775795 13: safe from 12 to 11: R0_w=-9223372036854775795 R1=scalar(umin=9223372036854775822,umax=9223372036854775823,var_off=(0x800000000000000e; 0x1),s32_min=14,s32_max=15,u32_min=14,u32_max=15) R2=pkt(off=0,r=1,imm=0) R3=pkt_end(off=0,imm=0) R10=fp0 11: (07) r0 += 1 ; R0_w=-9223372036854775794 12: (ad) if r0 < r1 goto pc-2 ; R0_w=-9223372036854775794 R1=-9223372036854775794 13: safe from 12 to 11: R0_w=-9223372036854775794 R1=-9223372036854775793 R2=pkt(off=0,r=1,imm=0) R3=pkt_end(off=0,imm=0) R10=fp0 11: (07) r0 += 1 ; R0_w=-9223372036854775793 12: (ad) if r0 < r1 goto pc-2 last_idx 12 first_idx 12 parent didn't have regs=1 stack=0 marks: R0_rw=P-9223372036854775801 R1_r=scalar(umin=9223372036854775815,umax=9223372036854775823,var_off=(0x8000000000000000; 0xf),s32_min=0,s32_max=15,u32_max=15) R2=pkt(off=0,r=1,imm=0) R3=pkt_end(off=0,imm=0) R10=fp0 last_idx 11 first_idx 11 regs=1 stack=0 before 11: (07) r0 += 1 parent didn't have regs=1 stack=0 marks: R0_rw=P-9223372036854775805 R1_rw=scalar(umin=9223372036854775812,umax=9223372036854775823,var_off=(0x8000000000000000; 0xf),s32_min=0,s32_max=15,u32_max=15) R2_w=pkt(off=0,r=1,imm=0) R3_w=pkt_end(off=0,imm=0) R10=fp0 last_idx 12 first_idx 0 regs=1 stack=0 before 12: (ad) if r0 < r1 goto pc-2 regs=1 stack=0 before 11: (07) r0 += 1 regs=1 stack=0 before 12: (ad) if r0 < r1 goto pc-2 regs=1 stack=0 before 11: (07) r0 += 1 regs=1 stack=0 before 12: (ad) if r0 < r1 goto pc-2 regs=1 stack=0 before 11: (07) r0 += 1 regs=1 stack=0 before 9: (18) r0 = 0x8000000000000000 last_idx 12 first_idx 12 parent didn't have regs=2 stack=0 marks: R0_rw=P-9223372036854775801 R1_r=Pscalar(umin=9223372036854775815,umax=9223372036854775823,var_off=(0x8000000000000000; 0xf),s32_min=0,s32_max=15,u32_max=15) R2=pkt(off=0,r=1,imm=0) R3=pkt_end(off=0,imm=0) R10=fp0 last_idx 11 first_idx 11 regs=2 stack=0 before 11: (07) r0 += 1 parent didn't have regs=2 stack=0 marks: R0_rw=P-9223372036854775805 R1_rw=Pscalar(umin=9223372036854775812,umax=9223372036854775823,var_off=(0x8000000000000000; 0xf),s32_min=0,s32_max=15,u32_max=15) R2_w=pkt(off=0,r=1,imm=0) R3_w=pkt_end(off=0,imm=0) R10=fp0 last_idx 12 first_idx 0 regs=2 stack=0 before 12: (ad) if r0 < r1 goto pc-2 regs=2 stack=0 before 11: (07) r0 += 1 regs=2 stack=0 before 12: (ad) if r0 < r1 goto pc-2 regs=2 stack=0 before 11: (07) r0 += 1 regs=2 stack=0 before 12: (ad) if r0 < r1 goto pc-2 regs=2 stack=0 before 11: (07) r0 += 1 regs=2 stack=0 before 9: (18) r0 = 0x8000000000000000 regs=2 stack=0 before 8: (0f) r1 += r0 regs=3 stack=0 before 6: (18) r0 = 0x7fffffffffffff10 regs=2 stack=0 before 5: (71) r1 = *(u8 *)(r2 +0) 13: safe from 4 to 13: safe verification time 322 usec stack depth 0 processed 56 insns (limit 1000000) max_states_per_insn 1 total_states 3 peak_states 3 mark_read 1 This also fixes up a test case along with this improvement where we match on the verifier log. The updated log now has a refined var_off, too. Fixes: 3f50f132d840 ("bpf: Verifier, do explicit ALU32 bounds tracking") Reported-by: Xu Kuohai <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Reviewed-by: John Fastabend <[email protected]> Link: https://lore.kernel.org/bpf/[email protected] Link: https://lore.kernel.org/bpf/[email protected]
2023-03-22module/decompress: Never use kunmap() for local un-mappingsFabio M. De Francesco1-1/+1
Use kunmap_local() to unmap pages locally mapped with kmap_local_page(). kunmap_local() must be called on the kernel virtual address returned by kmap_local_page(), differently from how we use kunmap() which instead expects the mapped page as its argument. In module_zstd_decompress() we currently map with kmap_local_page() and unmap with kunmap(). This breaks the code and so it should be fixed. Cc: Piotr Gorski <[email protected]> Cc: Dmitry Torokhov <[email protected]> Cc: Luis Chamberlain <[email protected]> Cc: Stephen Boyd <[email protected]> Cc: Ira Weiny <[email protected]> Fixes: 169a58ad824d ("module/decompress: Support zstd in-kernel decompression") Signed-off-by: Fabio M. De Francesco <[email protected]> Reviewed-by: Stephen Boyd <[email protected]> Reviewed-by: Ira Weiny <[email protected]> Reviewed-by: Piotr Gorski <[email protected]> Signed-off-by: Luis Chamberlain <[email protected]>
2023-03-22bpf: return long from bpf_map_ops funcsJP Kobryn16-89/+89
This patch changes the return types of bpf_map_ops functions to long, where previously int was returned. Using long allows for bpf programs to maintain the sign bit in the absence of sign extension during situations where inlined bpf helper funcs make calls to the bpf_map_ops funcs and a negative error is returned. The definitions of the helper funcs are generated from comments in the bpf uapi header at `include/uapi/linux/bpf.h`. The return type of these helpers was previously changed from int to long in commit bdb7b79b4ce8. For any case where one of the map helpers call the bpf_map_ops funcs that are still returning 32-bit int, a compiler might not include sign extension instructions to properly convert the 32-bit negative value a 64-bit negative value. For example: bpf assembly excerpt of an inlined helper calling a kernel function and checking for a specific error: ; err = bpf_map_update_elem(&mymap, &key, &val, BPF_NOEXIST); ... 46: call 0xffffffffe103291c ; htab_map_update_elem ; if (err && err != -EEXIST) { 4b: cmp $0xffffffffffffffef,%rax ; cmp -EEXIST,%rax kernel function assembly excerpt of return value from `htab_map_update_elem` returning 32-bit int: movl $0xffffffef, %r9d ... movl %r9d, %eax ...results in the comparison: cmp $0xffffffffffffffef, $0x00000000ffffffef Fixes: bdb7b79b4ce8 ("bpf: Switch most helper return values from 32-bit int to 64-bit long") Tested-by: Eduard Zingerman <[email protected]> Signed-off-by: JP Kobryn <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-03-22bpf: Teach the verifier to recognize rdonly_mem as not null.Alexei Starovoitov1-5/+9
Teach the verifier to recognize PTR_TO_MEM | MEM_RDONLY as not NULL otherwise if (!bpf_ksym_exists(known_kfunc)) doesn't go through dead code elimination. Signed-off-by: Alexei Starovoitov <[email protected]> Signed-off-by: Andrii Nakryiko <[email protected]> Acked-by: David Vernet <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2023-03-22livepatch,sched: Add livepatch task switching to cond_resched()Josh Poimboeuf3-23/+149
There have been reports [1][2] of live patches failing to complete within a reasonable amount of time due to CPU-bound kthreads. Fix it by patching tasks in cond_resched(). There are four different flavors of cond_resched(), depending on the kernel configuration. Hook into all of them. A more elegant solution might be to use a preempt notifier. However, non-ORC unwinders can't unwind a preempted task reliably. [1] https://lore.kernel.org/lkml/[email protected]/ [2] https://lkml.kernel.org/lkml/[email protected] Signed-off-by: Josh Poimboeuf <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Petr Mladek <[email protected]> Tested-by: Seth Forshee (DigitalOcean) <[email protected]> Link: https://lore.kernel.org/r/4ae981466b7814ec221014fc2554b2f86f3fb70b.1677257135.git.jpoimboe@kernel.org
2023-03-22livepatch: Skip task_call_func() for current taskJosh Poimboeuf1-1/+5
The current task doesn't need the scheduler's protection to unwind its own stack. Signed-off-by: Josh Poimboeuf <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Petr Mladek <[email protected]> Tested-by: Seth Forshee (DigitalOcean) <[email protected]> Link: https://lore.kernel.org/r/4b92e793462d532a05f03767151fa29db3e68e13.1677257135.git.jpoimboe@kernel.org
2023-03-22livepatch: Convert stack entries array to percpuJosh Poimboeuf1-2/+7
The entries array in klp_check_stack() is static local because it's too big to be reasonably allocated on the stack. Serialized access is enforced by the klp_mutex. In preparation for calling klp_check_stack() without the mutex (from cond_resched), convert it to a percpu variable. Signed-off-by: Josh Poimboeuf <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/20230313233346.kayh4t2lpicjkpsv@treble
2023-03-22sched: Interleave cfs bandwidth timers for improved single thread ↵Shrikanth Hegde1-0/+4
performance at low utilization CPU cfs bandwidth controller uses hrtimer. Currently there is no initial value set. Hence all period timers would align at expiry. This happens when there are multiple CPU cgroup's. There is a performance gain that can be achieved here if the timers are interleaved when the utilization of each CPU cgroup is low and total utilization of all the CPU cgroup's is less than 50%. If the timers are interleaved, then the unthrottled cgroup can run freely without many context switches and can also benefit from SMT Folding. This effect will be further amplified in SPLPAR environment. This commit adds a random offset after initializing each hrtimer. This would result in interleaving the timers at expiry, which helps in achieving the said performance gain. This was tested on powerpc platform with 8 core SMT=8. Socket power was measured when the workload. Benchmarked the stress-ng with power information. Throughput oriented benchmarks show significant gain up to 25% while power consumption increases up to 15%. Workload: stress-ng --cpu=32 --cpu-ops=50000. 1CG - 1 cgroup is running. 2CG - 2 cgroups are running together. Time taken to complete stress-ng in seconds and power is in watts. each cgroup is throttled at 25% with 100ms as the period value. 6.2-rc6 | with patch 8 core 1CG power 2CG power | 1CG power 2 CG power 27.5 80.6 40 90 | 27.3 82 32.3 104 27.5 81 40.2 91 | 27.5 81 38.7 96 27.7 80 40.1 89 | 27.6 80 29.7 106 27.7 80.1 40.3 94 | 27.6 80 31.5 105 Latency might be affected by this change. That could happen if the CPU was in a deep idle state which is possible if we interleave the timers. Used schbench for measuring the latency. Each cgroup is throttled at 25% with period value is set to 100ms. Numbers are when both the cgroups are running simultaneously. Latency values don't degrade much. Some improvement is seen in tail latencies. 6.2-rc6 with patch Groups: 16 50.0th: 39.5 42.5 75.0th: 924.0 922.0 90.0th: 972.0 968.0 95.0th: 1005.5 994.0 99.0th: 4166.0 2287.0 99.5th: 7314.0 7448.0 99.9th: 15024.0 13600.0 Groups: 32 50.0th: 819.0 463.0 75.0th: 1596.0 918.0 90.0th: 5992.0 1281.5 95.0th: 13184.0 2765.0 99.0th: 21792.0 14240.0 99.5th: 25696.0 18920.0 99.9th: 33280.0 35776.0 Groups: 64 50.0th: 4806.0 3440.0 75.0th: 31136.0 33664.0 90.0th: 54144.0 58752.0 95.0th: 66176.0 67200.0 99.0th: 84736.0 91520.0 99.5th: 97408.0 114048.0 99.9th: 136448.0 140032.0 Initial RFC PATCH, discussions and details on the problem: Link1: https://lore.kernel.org/lkml/[email protected]/ Link2: https://lore.kernel.org/lkml/[email protected]/ Suggested-by: Peter Zijlstra <[email protected]> Suggested-by: Thomas Gleixner <[email protected]> Signed-off-by: Shrikanth Hegde<[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Ben Segall <[email protected]> Reviewed-by: Vincent Guittot <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-03-22sched/core: Reduce cost of sched_move_task when config autogroupwuchi1-3/+19
Some sched_move_task calls are useless because that task_struct->sched_task_group maybe not changed (equals task_group of cpu_cgroup) when system enable autogroup. So do some checks in sched_move_task. sched_move_task eg: task A belongs to cpu_cgroup0 and autogroup0, it will always belong to cpu_cgroup0 when do_exit. So there is no need to do {de|en}queue. The call graph is as follow. do_exit sched_autogroup_exit_task sched_move_task dequeue_task sched_change_group A.sched_task_group = sched_get_task_group (=cpu_cgroup0) enqueue_task Performance results: =========================== 1. env cpu: bogomips=4600.00 kernel: 6.3.0-rc3 cpu_cgroup: 6:cpu,cpuacct:/user.slice 2. cmds do_exit script: for i in {0..10000}; do sleep 0 & done wait Run the above script, then use the following bpftrace cmd to get the cost of sched_move_task: bpftrace -e 'k:sched_move_task { @ts[tid] = nsecs; } kr:sched_move_task /@ts[tid]/ { @ns += nsecs - @ts[tid]; delete(@ts[tid]); }' 3. cost time(ns): without patch: 43528033 with patch: 18541416 diff:-24986617 -57.4% As the result show, the patch will save 57.4% in the scenario. Signed-off-by: wuchi <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2023-03-22sched/core: Avoid selecting the task that is throttled to run when ↵Hao Jia5-19/+90
core-sched enable When {rt, cfs}_rq or dl task is throttled, since cookied tasks are not dequeued from the core tree, So sched_core_find() and sched_core_next() may return throttled task, which may cause throttled task to run on the CPU. So we add checks in sched_core_find() and sched_core_next() to make sure that the return is a runnable task that is not throttled. Co-developed-by: Cruz Zhao <[email protected]> Signed-off-by: Cruz Zhao <[email protected]> Signed-off-by: Hao Jia <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2023-03-22sched/topology: Make sched_energy_mutex,update staticTom Rix1-2/+2
smatch reports kernel/sched/topology.c:212:1: warning: symbol 'sched_energy_mutex' was not declared. Should it be static? kernel/sched/topology.c:213:6: warning: symbol 'sched_energy_update' was not declared. Should it be static? These variables are only used in topology.c, so should be static Signed-off-by: Tom Rix <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Valentin Schneider <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2023-03-22swiotlb: fix slot alignment checksPetr Tesarik1-6/+10
Explicit alignment and page alignment are used only to calculate the stride, not when checking actual slot physical address. Originally, only page alignment was implemented, and that worked, because the whole SWIOTLB is allocated on a page boundary, so aligning the start index was sufficient to ensure a page-aligned slot. When commit 1f221a0d0dbf ("swiotlb: respect min_align_mask") added support for min_align_mask, the index could be incremented in the search loop, potentially finding an unaligned slot if minimum device alignment is between IO_TLB_SIZE and PAGE_SIZE. The bug could go unnoticed, because the slot size is 2 KiB, and the most common page size is 4 KiB, so there is no alignment value in between. IIUC the intention has been to find a slot that conforms to all alignment constraints: device minimum alignment, an explicit alignment (given as function parameter) and optionally page alignment (if allocation size is >= PAGE_SIZE). The most restrictive mask can be trivially computed with logical AND. The rest can stay. Fixes: 1f221a0d0dbf ("swiotlb: respect min_align_mask") Fixes: e81e99bacc9f ("swiotlb: Support aligned swiotlb buffers") Signed-off-by: Petr Tesarik <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2023-03-22swiotlb: use wrap_area_index() instead of open-coding itPetr Tesarik1-4/+1
No functional change, just use an existing helper. Signed-off-by: Petr Tesarik <[email protected]> Signed-off-by: Christoph Hellwig <[email protected]>
2023-03-21bpf: Adjust insufficient default bpf_jit_limitDaniel Borkmann1-1/+1
We've seen recent AWS EKS (Kubernetes) user reports like the following: After upgrading EKS nodes from v20230203 to v20230217 on our 1.24 EKS clusters after a few days a number of the nodes have containers stuck in ContainerCreating state or liveness/readiness probes reporting the following error: Readiness probe errored: rpc error: code = Unknown desc = failed to exec in container: failed to start exec "4a11039f730203ffc003b7[...]": OCI runtime exec failed: exec failed: unable to start container process: unable to init seccomp: error loading seccomp filter into kernel: error loading seccomp filter: errno 524: unknown However, we had not been seeing this issue on previous AMIs and it only started to occur on v20230217 (following the upgrade from kernel 5.4 to 5.10) with no other changes to the underlying cluster or workloads. We tried the suggestions from that issue (sysctl net.core.bpf_jit_limit=452534528) which helped to immediately allow containers to be created and probes to execute but after approximately a day the issue returned and the value returned by cat /proc/vmallocinfo | grep bpf_jit | awk '{s+=$2} END {print s}' was steadily increasing. I tested bpf tree to observe bpf_jit_charge_modmem, bpf_jit_uncharge_modmem their sizes passed in as well as bpf_jit_current under tcpdump BPF filter, seccomp BPF and native (e)BPF programs, and the behavior all looks sane and expected, that is nothing "leaking" from an upstream perspective. The bpf_jit_limit knob was originally added in order to avoid a situation where unprivileged applications loading BPF programs (e.g. seccomp BPF policies) consuming all the module memory space via BPF JIT such that loading of kernel modules would be prevented. The default limit was defined back in 2018 and while good enough back then, we are generally seeing far more BPF consumers today. Adjust the limit for the BPF JIT pool from originally 1/4 to now 1/2 of the module memory space to better reflect today's needs and avoid more users running into potentially hard to debug issues. Fixes: fdadd04931c2 ("bpf: fix bpf_jit_limit knob for PAGE_SIZE >= 64K") Reported-by: Stephen Haynes <[email protected]> Reported-by: Lefteris Alexakis <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Link: https://github.com/awslabs/amazon-eks-ami/issues/1179 Link: https://github.com/awslabs/amazon-eks-ami/issues/1219 Reviewed-by: Kuniyuki Iwashima <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>
2023-03-21ftrace: Show a list of all functions that have ever been enabledSteven Rostedt (Google)1-5/+46
When debugging a crash that appears to be related to ftrace, but not for sure, it is useful to know if a function was ever enabled by ftrace or not. It could be that a BPF program was attached to it, or possibly a live patch. We are having crashes in the field where this information is not always known. But having ftrace set a flag if a function has ever been attached since boot up helps tremendously in trying to know if a crash had to do with something using ftrace. For analyzing crashes, the use of a kdump image can have access to the flags. When looking at issues where the kernel did not panic, the touched_functions file can simply be used. Link: https://lore.kernel.org/linux-trace-kernel/[email protected] Cc: Masami Hiramatsu <[email protected]> Cc: Catalin Marinas <[email protected]> Tested-by: Mark Rutland <[email protected]> Tested-by: Chris Li <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2023-03-21ring_buffer: Use try_cmpxchg instead of cmpxchgUros Bizjak1-4/+4
Use try_cmpxchg instead of cmpxchg (*ptr, old, new) == old. x86 CMPXCHG instruction returns success in ZF flag, so this change saves a compare after cmpxchg (and related move instruction in front of cmpxchg). Also, try_cmpxchg implicitly assigns old *ptr value to "old" when cmpxchg fails. There is no need to re-read the value in the loop. No functional change intended. Link: https://lkml.kernel.org/r/[email protected] Cc: Masami Hiramatsu <[email protected]> Signed-off-by: Uros Bizjak <[email protected]> Acked-by: Mukesh Ojha <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2023-03-21ring_buffer: Change some static functions to boolUros Bizjak1-24/+23
The return values of some functions are of boolean type. Change the type of these function to bool and adjust their return values. Also change type of some internal varibles to bool. No functional change intended. Link: https://lkml.kernel.org/r/[email protected] Cc: Masami Hiramatsu <[email protected]> Signed-off-by: Uros Bizjak <[email protected]> Reviewed-by: Mukesh Ojha <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2023-03-21ring_buffer: Change some static functions to voidUros Bizjak1-15/+7
The results of some static functions are not used. Change the type of these function to void and remove unnecessary returns. No functional change intended. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Uros Bizjak <[email protected]> Reviewed-by: Masami Hiramatsu <[email protected]> Reviewed-by: Mukesh Ojha <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2023-03-21ftrace: selftest: remove broken trace_direct_trampMark Rutland1-10/+2
The ftrace selftest code has a trace_direct_tramp() function which it uses as a direct call trampoline. This happens to work on x86, since the direct call's return address is in the usual place, and can be returned to via a RET, but in general the calling convention for direct calls is different from regular function calls, and requires a trampoline written in assembly. On s390, regular function calls place the return address in %r14, and an ftrace patch-site in an instrumented function places the trampoline's return address (which is within the instrumented function) in %r0, preserving the original %r14 value in-place. As a regular C function will return to the address in %r14, using a C function as the trampoline results in the trampoline returning to the caller of the instrumented function, skipping the body of the instrumented function. Note that the s390 issue is not detcted by the ftrace selftest code, as the instrumented function is trivial, and returning back into the caller happens to be equivalent. On arm64, regular function calls place the return address in x30, and an ftrace patch-site in an instrumented function saves this into r9 and places the trampoline's return address (within the instrumented function) in x30. A regular C function will return to the address in x30, but will not restore x9 into x30. Consequently, using a C function as the trampoline results in returning to the trampoline's return address having corrupted x30, such that when the instrumented function returns, it will return back into itself. To avoid future issues in this area, remove the trace_direct_tramp() function, and require that each architecture with direct calls provides a stub trampoline, named ftrace_stub_direct_tramp. This can be written to handle the architecture's trampoline calling convention, and in future could be used elsewhere (e.g. in the ftrace ops sample, to measure the overhead of direct calls), so we may as well always build it in. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Mark Rutland <[email protected]> Cc: Li Huafei <[email protected]> Cc: Xu Kuohai <[email protected]> Signed-off-by: Florent Revest <[email protected]> Acked-by: Jiri Olsa <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2023-03-21ftrace: Make DIRECT_CALLS work WITH_ARGS and !WITH_REGSFlorent Revest2-2/+2
Direct called trampolines can be called in two ways: - either from the ftrace callsite. In this case, they do not access any struct ftrace_regs nor pt_regs - Or, if a ftrace ops is also attached, from the end of a ftrace trampoline. In this case, the call_direct_funcs ops is in charge of setting the direct call trampoline's address in a struct ftrace_regs Since: commit 9705bc709604 ("ftrace: pass fregs to arch_ftrace_set_direct_caller()") The later case no longer requires a full pt_regs. It only needs a struct ftrace_regs so DIRECT_CALLS can work with both WITH_ARGS or WITH_REGS. With architectures like arm64 already abandoning WITH_REGS in favor of WITH_ARGS, it's important to have DIRECT_CALLS work WITH_ARGS only. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Florent Revest <[email protected]> Co-developed-by: Mark Rutland <[email protected]> Signed-off-by: Mark Rutland <[email protected]> Acked-by: Jiri Olsa <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2023-03-21ftrace: Store direct called addresses in their opsFlorent Revest1-2/+5
All direct calls are now registered using the register_ftrace_direct API so each ops can jump to only one direct-called trampoline. By storing the direct called trampoline address directly in the ops we can save one hashmap lookup in the direct call ops and implement arm64 direct calls on top of call ops. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Florent Revest <[email protected]> Acked-by: Jiri Olsa <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>
2023-03-21ftrace: Rename _ftrace_direct_multi APIs to _ftrace_direct APIsFlorent Revest3-27/+28
Now that the original _ftrace_direct APIs are gone, the "_multi" suffixes only add confusion. Link: https://lkml.kernel.org/r/[email protected] Signed-off-by: Florent Revest <[email protected]> Acked-by: Mark Rutland <[email protected]> Tested-by: Mark Rutland <[email protected]> Acked-by: Jiri Olsa <[email protected]> Signed-off-by: Steven Rostedt (Google) <[email protected]>