aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2021-12-18bpf: Replace ARG_XXX_OR_NULL with ARG_XXX | PTR_MAYBE_NULLHao Luo1-25/+14
We have introduced a new type to make bpf_arg composable, by reserving high bits of bpf_arg to represent flags of a type. One of the flags is PTR_MAYBE_NULL which indicates a pointer may be NULL. When applying this flag to an arg_type, it means the arg can take NULL pointer. This patch switches the qualified arg_types to use this flag. The arg_types changed in this patch include: 1. ARG_PTR_TO_MAP_VALUE_OR_NULL 2. ARG_PTR_TO_MEM_OR_NULL 3. ARG_PTR_TO_CTX_OR_NULL 4. ARG_PTR_TO_SOCKET_OR_NULL 5. ARG_PTR_TO_ALLOC_MEM_OR_NULL 6. ARG_PTR_TO_STACK_OR_NULL This patch does not eliminate the use of these arg_types, instead it makes them an alias to the 'ARG_XXX | PTR_MAYBE_NULL'. Signed-off-by: Hao Luo <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2021-12-18Merge branch 'locking/urgent' into locking/coreThomas Gleixner1-1/+1
Pick up the spin loop condition fix. Signed-off-by: Thomas Gleixner <[email protected]>
2021-12-18locking/rtmutex: Fix incorrect condition in rtmutex_spin_on_owner()Zqiang1-1/+1
Optimistic spinning needs to be terminated when the spinning waiter is not longer the top waiter on the lock, but the condition is negated. It terminates if the waiter is the top waiter, which is defeating the whole purpose. Fixes: c3123c431447 ("locking/rtmutex: Dont dereference waiter lockless") Signed-off-by: Zqiang <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Cc: [email protected] Link: https://lore.kernel.org/r/[email protected]
2021-12-17timekeeping: Really make sure wall_to_monotonic isn't positiveYu Liao1-2/+1
Even after commit e1d7ba873555 ("time: Always make sure wall_to_monotonic isn't positive") it is still possible to make wall_to_monotonic positive by running the following code: int main(void) { struct timespec time; clock_gettime(CLOCK_MONOTONIC, &time); time.tv_nsec = 0; clock_settime(CLOCK_REALTIME, &time); return 0; } The reason is that the second parameter of timespec64_compare(), ts_delta, may be unnormalized because the delta is calculated with an open coded substraction which causes the comparison of tv_sec to yield the wrong result: wall_to_monotonic = { .tv_sec = -10, .tv_nsec = 900000000 } ts_delta = { .tv_sec = -9, .tv_nsec = -900000000 } That makes timespec64_compare() claim that wall_to_monotonic < ts_delta, but actually the result should be wall_to_monotonic > ts_delta. After normalization, the result of timespec64_compare() is correct because the tv_sec comparison is not longer misleading: wall_to_monotonic = { .tv_sec = -10, .tv_nsec = 900000000 } ts_delta = { .tv_sec = -10, .tv_nsec = 100000000 } Use timespec64_sub() to ensure that ts_delta is normalized, which fixes the issue. Fixes: e1d7ba873555 ("time: Always make sure wall_to_monotonic isn't positive") Signed-off-by: Yu Liao <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Cc: [email protected] Link: https://lore.kernel.org/r/[email protected]
2021-12-16Only output backtracking information in log level 2Christy Lee1-3/+3
Backtracking information is very verbose, don't print it in log level 1 to improve readability. Signed-off-by: Christy Lee <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Acked-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2021-12-16bpf: Right align verifier states in verifier logs.Christy Lee1-21/+36
Make the verifier logs more readable, print the verifier states on the corresponding instruction line. If the previous line was not a bpf instruction, then print the verifier states on its own line. Before: Validating test_pkt_access_subprog3() func#3... 86: R1=invP(id=0) R2=ctx(id=0,off=0,imm=0) R10=fp0 ; int test_pkt_access_subprog3(int val, struct __sk_buff *skb) 86: (bf) r6 = r2 87: R2=ctx(id=0,off=0,imm=0) R6_w=ctx(id=0,off=0,imm=0) 87: (bc) w7 = w1 88: R1=invP(id=0) R7_w=invP(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff)) ; return get_skb_len(skb) * get_skb_ifindex(val, skb, get_constant(123)); 88: (bf) r1 = r6 89: R1_w=ctx(id=0,off=0,imm=0) R6_w=ctx(id=0,off=0,imm=0) 89: (85) call pc+9 Func#4 is global and valid. Skipping. 90: R0_w=invP(id=0) 90: (bc) w8 = w0 91: R0_w=invP(id=0) R8_w=invP(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff)) ; return get_skb_len(skb) * get_skb_ifindex(val, skb, get_constant(123)); 91: (b7) r1 = 123 92: R1_w=invP123 92: (85) call pc+65 Func#5 is global and valid. Skipping. 93: R0=invP(id=0) After: 86: R1=invP(id=0) R2=ctx(id=0,off=0,imm=0) R10=fp0 ; int test_pkt_access_subprog3(int val, struct __sk_buff *skb) 86: (bf) r6 = r2 ; R2=ctx(id=0,off=0,imm=0) R6_w=ctx(id=0,off=0,imm=0) 87: (bc) w7 = w1 ; R1=invP(id=0) R7_w=invP(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff)) ; return get_skb_len(skb) * get_skb_ifindex(val, skb, get_constant(123)); 88: (bf) r1 = r6 ; R1_w=ctx(id=0,off=0,imm=0) R6_w=ctx(id=0,off=0,imm=0) 89: (85) call pc+9 Func#4 is global and valid. Skipping. 90: R0_w=invP(id=0) 90: (bc) w8 = w0 ; R0_w=invP(id=0) R8_w=invP(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff)) ; return get_skb_len(skb) * get_skb_ifindex(val, skb, get_constant(123)); 91: (b7) r1 = 123 ; R1_w=invP123 92: (85) call pc+65 Func#5 is global and valid. Skipping. 93: R0=invP(id=0) Signed-off-by: Christy Lee <[email protected]> Acked-by: Andrii Nakryiko <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]>
2021-12-16bpf: Only print scratched registers and stack slots to verifier logs.Christy Lee1-14/+69
When printing verifier state for any log level, print full verifier state only on function calls or on errors. Otherwise, only print the registers and stack slots that were accessed. Log size differences: verif_scale_loop6 before: 234566564 verif_scale_loop6 after: 72143943 69% size reduction kfree_skb before: 166406 kfree_skb after: 55386 69% size reduction Before: 156: (61) r0 = *(u32 *)(r1 +0) 157: R0_w=invP(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff)) R1=ctx(id=0,off=0,imm=0) R2_w=invP0 R10=fp0 fp-8_w=00000000 fp-16_w=00\ 000000 fp-24_w=00000000 fp-32_w=00000000 fp-40_w=00000000 fp-48_w=00000000 fp-56_w=00000000 fp-64_w=00000000 fp-72_w=00000000 fp-80_w=00000\ 000 fp-88_w=00000000 fp-96_w=00000000 fp-104_w=00000000 fp-112_w=00000000 fp-120_w=00000000 fp-128_w=00000000 fp-136_w=00000000 fp-144_w=00\ 000000 fp-152_w=00000000 fp-160_w=00000000 fp-168_w=00000000 fp-176_w=00000000 fp-184_w=00000000 fp-192_w=00000000 fp-200_w=00000000 fp-208\ _w=00000000 fp-216_w=00000000 fp-224_w=00000000 fp-232_w=00000000 fp-240_w=00000000 fp-248_w=00000000 fp-256_w=00000000 fp-264_w=00000000 f\ p-272_w=00000000 fp-280_w=00000000 fp-288_w=00000000 fp-296_w=00000000 fp-304_w=00000000 fp-312_w=00000000 fp-320_w=00000000 fp-328_w=00000\ 000 fp-336_w=00000000 fp-344_w=00000000 fp-352_w=00000000 fp-360_w=00000000 fp-368_w=00000000 fp-376_w=00000000 fp-384_w=00000000 fp-392_w=\ 00000000 fp-400_w=00000000 fp-408_w=00000000 fp-416_w=00000000 fp-424_w=00000000 fp-432_w=00000000 fp-440_w=00000000 fp-448_w=00000000 ; return skb->len; 157: (95) exit Func#4 is safe for any args that match its prototype Validating get_constant() func#5... 158: R1=invP(id=0) R10=fp0 ; int get_constant(long val) 158: (bf) r0 = r1 159: R0_w=invP(id=1) R1=invP(id=1) R10=fp0 ; return val - 122; 159: (04) w0 += -122 160: R0_w=invP(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff)) R1=invP(id=1) R10=fp0 ; return val - 122; 160: (95) exit Func#5 is safe for any args that match its prototype Validating get_skb_ifindex() func#6... 161: R1=invP(id=0) R2=ctx(id=0,off=0,imm=0) R3=invP(id=0) R10=fp0 ; int get_skb_ifindex(int val, struct __sk_buff *skb, int var) 161: (bc) w0 = w3 162: R0_w=invP(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff)) R1=invP(id=0) R2=ctx(id=0,off=0,imm=0) R3=invP(id=0) R10=fp0 After: 156: (61) r0 = *(u32 *)(r1 +0) 157: R0_w=invP(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff)) R1=ctx(id=0,off=0,imm=0) ; return skb->len; 157: (95) exit Func#4 is safe for any args that match its prototype Validating get_constant() func#5... 158: R1=invP(id=0) R10=fp0 ; int get_constant(long val) 158: (bf) r0 = r1 159: R0_w=invP(id=1) R1=invP(id=1) ; return val - 122; 159: (04) w0 += -122 160: R0_w=invP(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff)) ; return val - 122; 160: (95) exit Func#5 is safe for any args that match its prototype Validating get_skb_ifindex() func#6... 161: R1=invP(id=0) R2=ctx(id=0,off=0,imm=0) R3=invP(id=0) R10=fp0 ; int get_skb_ifindex(int val, struct __sk_buff *skb, int var) 161: (bc) w0 = w3 162: R0_w=invP(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff)) R3=invP(id=0) Signed-off-by: Christy Lee <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Acked-by: Andrii Nakryiko <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2021-12-16Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski6-41/+75
No conflicts. Signed-off-by: Jakub Kicinski <[email protected]>
2021-12-16Merge tag 'audit-pr-20211216' of ↵Linus Torvalds1-11/+10
git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit Pull audit fix from Paul Moore: "A single patch to fix a problem where the audit queue could grow unbounded when the audit daemon is forcibly stopped" * tag 'audit-pr-20211216' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit: audit: improve robustness of the audit queue handling
2021-12-16Merge tag 'net-5.16-rc6' of ↵Linus Torvalds1-17/+36
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Networking fixes, including fixes from mac80211, wifi, bpf. Relatively large batches of fixes from BPF and the WiFi stack, calm in general networking. Current release - regressions: - dpaa2-eth: fix buffer overrun when reporting ethtool statistics Current release - new code bugs: - bpf: fix incorrect state pruning for <8B spill/fill - iavf: - add missing unlocks in iavf_watchdog_task() - do not override the adapter state in the watchdog task (again) - mlxsw: spectrum_router: consolidate MAC profiles when possible Previous releases - regressions: - mac80211 fixes: - rate control, avoid driver crash for retransmitted frames - regression in SSN handling of addba tx - a memory leak where sta_info is not freed - marking TX-during-stop for TX in in_reconfig, prevent stall - cfg80211: acquire wiphy mutex on regulatory work - wifi drivers: fix build regressions and LED config dependency - virtio_net: fix rx_drops stat for small pkts - dsa: mv88e6xxx: unforce speed & duplex in mac_link_down() Previous releases - always broken: - bpf fixes: - kernel address leakage in atomic fetch - kernel address leakage in atomic cmpxchg's r0 aux reg - signed bounds propagation after mov32 - extable fixup offset - extable address check - mac80211: - fix the size used for building probe request - send ADDBA requests using the tid/queue of the aggregation session - agg-tx: don't schedule_and_wake_txq() under sta->lock, avoid deadlocks - validate extended element ID is present - mptcp: - never allow the PM to close a listener subflow (null-defer) - clear 'kern' flag from fallback sockets, prevent crash - fix deadlock in __mptcp_push_pending() - inet_diag: fix kernel-infoleak for UDP sockets - xsk: do not sleep in poll() when need_wakeup set - smc: avoid very long waits in smc_release() - sch_ets: don't remove idle classes from the round-robin list - netdevsim: - zero-initialize memory for bpf map's value, prevent info leak - don't let user space overwrite read only (max) ethtool parms - ixgbe: set X550 MDIO speed before talking to PHY - stmmac: - fix null-deref in flower deletion w/ VLAN prio Rx steering - dwmac-rk: fix oob read in rk_gmac_setup - ice: time stamping fixes - systemport: add global locking for descriptor life cycle" * tag 'net-5.16-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (89 commits) bpf, selftests: Fix racing issue in btf_skc_cls_ingress test selftest/bpf: Add a test that reads various addresses. bpf: Fix extable address check. bpf: Fix extable fixup offset. bpf, selftests: Add test case trying to taint map value pointer bpf: Make 32->64 bounds propagation slightly more robust bpf: Fix signed bounds propagation after mov32 sit: do not call ipip6_dev_free() from sit_init_net() net: systemport: Add global locking for descriptor lifecycle net/smc: Prevent smc_release() from long blocking net: Fix double 0x prefix print in SKB dump virtio_net: fix rx_drops stat for small pkts dsa: mv88e6xxx: fix debug print for SPEED_UNFORCED sfc_ef100: potential dereference of null pointer net: stmmac: dwmac-rk: fix oob read in rk_gmac_setup net: usb: lan78xx: add Allied Telesis AT29M2-AF net/packet: rx_owner_map depends on pg_vec netdevsim: Zero-initialize memory for new map's value in function nsim_bpf_map_alloc dpaa2-eth: fix ethtool statistics ixgbe: set X550 MDIO speed before talking to PHY ...
2021-12-16add missing bpf-cgroup.h includesJakub Kicinski6-0/+6
We're about to break the cgroup-defs.h -> bpf-cgroup.h dependency, make sure those who actually need more than the definition of struct cgroup_bpf include bpf-cgroup.h explicitly. Signed-off-by: Jakub Kicinski <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]> Acked-by: Tejun Heo <[email protected]> Link: https://lore.kernel.org/bpf/[email protected]
2021-12-16genirq/msi: Convert storage to xarrayThomas Gleixner1-92/+77
The current linked list storage for MSI descriptors is suboptimal in several ways: 1) Looking up a MSI desciptor requires a O(n) list walk in the worst case 2) The upcoming support of runtime expansion of MSI-X vectors would need to do a full list walk to figure out whether a particular index is already associated. 3) Runtime expansion of sparse allocations is even more complex as the current implementation assumes an ordered list (increasing MSI index). Use an xarray which solves all of the above problems nicely. Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Michael Kelley <[email protected]> Tested-by: Nishanth Menon <[email protected]> Reviewed-by: Greg Kroah-Hartman <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-12-16genirq/msi: Simplify sysfs handlingThomas Gleixner1-107/+91
The sysfs handling for MSI is a convoluted maze and it is in the way of supporting dynamic expansion of the MSI-X vectors because it only supports a one off bulk population/free of the sysfs entries. Change it to do: 1) Creating an empty sysfs attribute group when msi_device_data is allocated 2) Populate the entries when the MSI descriptor is initialized 3) Free the entries when a MSI descriptor is detached from a Linux interrupt. 4) Provide functions for the legacy non-irqdomain fallback code to do a bulk population/free. This code won't support dynamic expansion. This makes the code simpler and reduces the number of allocations as the empty attribute group can be shared. Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Michael Kelley <[email protected]> Tested-by: Nishanth Menon <[email protected]> Reviewed-by: Greg Kroah-Hartman <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-12-16genirq/msi: Mop up old interfacesThomas Gleixner1-16/+15
Get rid of the old iterators, alloc/free functions and adjust the core code accordingly. Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Michael Kelley <[email protected]> Tested-by: Nishanth Menon <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-12-16genirq/msi: Convert to new functionsThomas Gleixner1-9/+14
Use the new iterator functions and add locking where required. Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Michael Kelley <[email protected]> Tested-by: Nishanth Menon <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-12-16genirq/msi: Make interrupt allocation less convolutedThomas Gleixner1-60/+69
There is no real reason to do several loops over the MSI descriptors instead of just doing one loop. In case of an error everything is undone anyway so it does not matter whether it's a partial or a full rollback. Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Michael Kelley <[email protected]> Tested-by: Nishanth Menon <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-12-16platform-msi: Simplify platform device MSI codeThomas Gleixner1-24/+21
The allocation code is overly complex. It tries to have the MSI index space packed, which is not working when an interrupt is freed. There is no requirement for this. The only requirement is that the MSI index is unique. Move the MSI descriptor allocation into msi_domain_populate_irqs() and use the Linux interrupt number as MSI index which fulfils the unique requirement. This requires to lock the MSI descriptors which makes the lock order reverse to the regular MSI alloc/free functions vs. the domain mutex. Assign a seperate lockdep class for these MSI device domains. Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Nishanth Menon <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-12-16genirq/msi: Provide domain flags to allocate/free MSI descriptors automaticallyThomas Gleixner1-0/+48
Provide domain info flags which tell the core to allocate simple descriptors or to free descriptors when the interrupts are freed and implement the required functionality. Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Michael Kelley <[email protected]> Tested-by: Nishanth Menon <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-12-16genirq/msi: Provide msi_alloc_msi_desc() and a simple allocatorThomas Gleixner1-0/+59
Provide msi_alloc_msi_desc() which takes a template MSI descriptor for initializing a newly allocated descriptor. This allows to simplify various usage sites of alloc_msi_entry() and moves the storage handling into the core code. For simple cases where only a linear vector space is required provide msi_add_simple_msi_descs() which just allocates a linear range of MSI descriptors and fills msi_desc::msi_index accordingly. Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Michael Kelley <[email protected]> Tested-by: Nishanth Menon <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-12-16genirq/msi: Provide a set of advanced MSI accessors and iteratorsThomas Gleixner1-0/+96
In preparation for dynamic handling of MSI-X interrupts provide a new set of MSI descriptor accessor functions and iterators. They are benefitial per se as they allow to cleanup quite some code in various MSI domain implementations. Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Michael Kelley <[email protected]> Tested-by: Nishanth Menon <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-12-16genirq/msi: Provide msi_domain_alloc/free_irqs_descs_locked()Thomas Gleixner1-16/+58
Usage sites which do allocations of the MSI descriptors before invoking msi_domain_alloc_irqs() require to lock the MSI decriptors accross the operation. Provide entry points which can be called with the MSI mutex held and lock the mutex in the existing entry points. Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Michael Kelley <[email protected]> Tested-by: Nishanth Menon <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-12-16genirq/msi: Add mutex for MSI list protectionThomas Gleixner1-0/+21
For upcoming runtime extensions of MSI-X interrupts it's required to protect the MSI descriptor list. Add a mutex to struct msi_device_data and provide lock/unlock functions. Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Michael Kelley <[email protected]> Tested-by: Nishanth Menon <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-12-16genirq/msi: Move descriptor list to struct msi_device_dataThomas Gleixner1-1/+4
It's only required when MSI is in use. Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Michael Kelley <[email protected]> Tested-by: Nishanth Menon <[email protected]> Reviewed-by: Greg Kroah-Hartman <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-12-16genirq/msi: Provide interface to retrieve Linux interrupt numberThomas Gleixner1-0/+36
This allows drivers to retrieve the Linux interrupt number instead of fiddling with MSI descriptors. msi_get_virq() returns the Linux interrupt number or 0 in case that there is no entry for the given MSI index. Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Michael Kelley <[email protected]> Tested-by: Nishanth Menon <[email protected]> Reviewed-by: Greg Kroah-Hartman <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-12-16genirq/msi: Remove the original sysfs interfacesThomas Gleixner1-33/+20
No more users. Refactor the core code accordingly and move the global interface under CONFIG_PCI_MSI_ARCH_FALLBACKS. Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Michael Kelley <[email protected]> Tested-by: Nishanth Menon <[email protected]> Reviewed-by: Greg Kroah-Hartman <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-12-16genirq/msi: Provide msi_device_populate/destroy_sysfs()Thomas Gleixner1-2/+40
Add new allocation functions which can be activated by domain info flags. They store the groups pointer in struct msi_device_data. Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Michael Kelley <[email protected]> Tested-by: Nishanth Menon <[email protected]> Reviewed-by: Greg Kroah-Hartman <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-12-16device: Add device:: Msi_data pointer and struct msi_device_dataThomas Gleixner1-0/+32
Create struct msi_device_data and add a pointer of that type to struct dev_msi_info, which is part of struct device. Provide an allocator function which can be invoked from the MSI interrupt allocation code pathes. Add a properties field to the data structure as a first member so the allocation size is not zero bytes. The field will be uses later on. Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Michael Kelley <[email protected]> Tested-by: Nishanth Menon <[email protected]> Reviewed-by: Greg Kroah-Hartman <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-12-16genirq/msi: Use PCI device propertyThomas Gleixner1-15/+2
to determine whether this is MSI or MSIX instead of consulting MSI descriptors. Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Michael Kelley <[email protected]> Tested-by: Nishanth Menon <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Link: https://lore.kernel.org/r/[email protected]
2021-12-16bpf: Make 32->64 bounds propagation slightly more robustDaniel Borkmann1-9/+15
Make the bounds propagation in __reg_assign_32_into_64() slightly more robust and readable by aligning it similarly as we did back in the __reg_combine_64_into_32() counterpart. Meaning, only propagate or pessimize them as a smin/smax pair. Signed-off-by: Daniel Borkmann <[email protected]> Reviewed-by: John Fastabend <[email protected]> Acked-by: Alexei Starovoitov <[email protected]>
2021-12-16bpf: Fix signed bounds propagation after mov32Daniel Borkmann1-0/+4
For the case where both s32_{min,max}_value bounds are positive, the __reg_assign_32_into_64() directly propagates them to their 64 bit counterparts, otherwise it pessimises them into [0,u32_max] universe and tries to refine them later on by learning through the tnum as per comment in mentioned function. However, that does not always happen, for example, in mov32 operation we call zext_32_to_64(dst_reg) which invokes the __reg_assign_32_into_64() as is without subsequent bounds update as elsewhere thus no refinement based on tnum takes place. Thus, not calling into the __update_reg_bounds() / __reg_deduce_bounds() / __reg_bound_offset() triplet as we do, for example, in case of ALU ops via adjust_scalar_min_max_vals(), will lead to more pessimistic bounds when dumping the full register state: Before fix: 0: (b4) w0 = -1 1: R0_w=invP4294967295 (id=0,imm=ffffffff, smin_value=4294967295,smax_value=4294967295, umin_value=4294967295,umax_value=4294967295, var_off=(0xffffffff; 0x0), s32_min_value=-1,s32_max_value=-1, u32_min_value=-1,u32_max_value=-1) 1: (bc) w0 = w0 2: R0_w=invP4294967295 (id=0,imm=ffffffff, smin_value=0,smax_value=4294967295, umin_value=4294967295,umax_value=4294967295, var_off=(0xffffffff; 0x0), s32_min_value=-1,s32_max_value=-1, u32_min_value=-1,u32_max_value=-1) Technically, the smin_value=0 and smax_value=4294967295 bounds are not incorrect, but given the register is still a constant, they break assumptions about const scalars that smin_value == smax_value and umin_value == umax_value. After fix: 0: (b4) w0 = -1 1: R0_w=invP4294967295 (id=0,imm=ffffffff, smin_value=4294967295,smax_value=4294967295, umin_value=4294967295,umax_value=4294967295, var_off=(0xffffffff; 0x0), s32_min_value=-1,s32_max_value=-1, u32_min_value=-1,u32_max_value=-1) 1: (bc) w0 = w0 2: R0_w=invP4294967295 (id=0,imm=ffffffff, smin_value=4294967295,smax_value=4294967295, umin_value=4294967295,umax_value=4294967295, var_off=(0xffffffff; 0x0), s32_min_value=-1,s32_max_value=-1, u32_min_value=-1,u32_max_value=-1) Without the smin_value == smax_value and umin_value == umax_value invariant being intact for const scalars, it is possible to leak out kernel pointers from unprivileged user space if the latter is enabled. For example, when such registers are involved in pointer arithmtics, then adjust_ptr_min_max_vals() will taint the destination register into an unknown scalar, and the latter can be exported and stored e.g. into a BPF map value. Fixes: 3f50f132d840 ("bpf: Verifier, do explicit ALU32 bounds tracking") Reported-by: Kuee K1r0a <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Reviewed-by: John Fastabend <[email protected]> Acked-by: Alexei Starovoitov <[email protected]>
2021-12-15audit: improve robustness of the audit queue handlingPaul Moore1-11/+10
If the audit daemon were ever to get stuck in a stopped state the kernel's kauditd_thread() could get blocked attempting to send audit records to the userspace audit daemon. With the kernel thread blocked it is possible that the audit queue could grow unbounded as certain audit record generating events must be exempt from the queue limits else the system enter a deadlock state. This patch resolves this problem by lowering the kernel thread's socket sending timeout from MAX_SCHEDULE_TIMEOUT to HZ/10 and tweaks the kauditd_send_queue() function to better manage the various audit queues when connection problems occur between the kernel and the audit daemon. With this patch, the backlog may temporarily grow beyond the defined limits when the audit daemon is stopped and the system is under heavy audit pressure, but kauditd_thread() will continue to make progress and drain the queues as it would for other connection problems. For example, with the audit daemon put into a stopped state and the system configured to audit every syscall it was still possible to shutdown the system without a kernel panic, deadlock, etc.; granted, the system was slow to shutdown but that is to be expected given the extreme pressure of recording every syscall. The timeout value of HZ/10 was chosen primarily through experimentation and this developer's "gut feeling". There is likely no one perfect value, but as this scenario is limited in scope (root privileges would be needed to send SIGSTOP to the audit daemon), it is likely not worth exposing this as a tunable at present. This can always be done at a later date if it proves necessary. Cc: [email protected] Fixes: 5b52330bbfe63 ("audit: fix auditd/kernel connection state tracking") Reported-by: Gaosheng Cui <[email protected]> Tested-by: Gaosheng Cui <[email protected]> Reviewed-by: Richard Guy Briggs <[email protected]> Signed-off-by: Paul Moore <[email protected]>
2021-12-15audit: ensure userspace is penalized the same as the kernel when under pressurePaul Moore1-1/+17
Due to the audit control mutex necessary for serializing audit userspace messages we haven't been able to block/penalize userspace processes that attempt to send audit records while the system is under audit pressure. The result is that privileged userspace applications have a priority boost with respect to audit as they are not bound by the same audit queue throttling as the other tasks on the system. This patch attempts to restore some balance to the system when under audit pressure by blocking these privileged userspace tasks after they have finished their audit processing, and dropped the audit control mutex, but before they return to userspace. Reported-by: Gaosheng Cui <[email protected]> Tested-by: Gaosheng Cui <[email protected]> Reviewed-by: Richard Guy Briggs <[email protected]> Signed-off-by: Paul Moore <[email protected]>
2021-12-14bpf: Fix kernel address leakage in atomic cmpxchg's r0 aux regDaniel Borkmann1-1/+8
The implementation of BPF_CMPXCHG on a high level has the following parameters: .-[old-val] .-[new-val] BPF_R0 = cmpxchg{32,64}(DST_REG + insn->off, BPF_R0, SRC_REG) `-[mem-loc] `-[old-val] Given a BPF insn can only have two registers (dst, src), the R0 is fixed and used as an auxilliary register for input (old value) as well as output (returning old value from memory location). While the verifier performs a number of safety checks, it misses to reject unprivileged programs where R0 contains a pointer as old value. Through brute-forcing it takes about ~16sec on my machine to leak a kernel pointer with BPF_CMPXCHG. The PoC is basically probing for kernel addresses by storing the guessed address into the map slot as a scalar, and using the map value pointer as R0 while SRC_REG has a canary value to detect a matching address. Fix it by checking R0 for pointers, and reject if that's the case for unprivileged programs. Fixes: 5ffa25502b5a ("bpf: Add instructions for atomic_[cmp]xchg") Reported-by: Ryota Shiga (Flatt Security) Acked-by: Brendan Jackman <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]>
2021-12-14bpf: Fix kernel address leakage in atomic fetchDaniel Borkmann1-3/+9
The change in commit 37086bfdc737 ("bpf: Propagate stack bounds to registers in atomics w/ BPF_FETCH") around check_mem_access() handling is buggy since this would allow for unprivileged users to leak kernel pointers. For example, an atomic fetch/and with -1 on a stack destination which holds a spilled pointer will migrate the spilled register type into a scalar, which can then be exported out of the program (since scalar != pointer) by dumping it into a map value. The original implementation of XADD was preventing this situation by using a double call to check_mem_access() one with BPF_READ and a subsequent one with BPF_WRITE, in both cases passing -1 as a placeholder value instead of register as per XADD semantics since it didn't contain a value fetch. The BPF_READ also included a check in check_stack_read_fixed_off() which rejects the program if the stack slot is of __is_pointer_value() if dst_regno < 0. The latter is to distinguish whether we're dealing with a regular stack spill/ fill or some arithmetical operation which is disallowed on non-scalars, see also 6e7e63cbb023 ("bpf: Forbid XADD on spilled pointers for unprivileged users") for more context on check_mem_access() and its handling of placeholder value -1. One minimally intrusive option to fix the leak is for the BPF_FETCH case to initially check the BPF_READ case via check_mem_access() with -1 as register, followed by the actual load case with non-negative load_reg to propagate stack bounds to registers. Fixes: 37086bfdc737 ("bpf: Propagate stack bounds to registers in atomics w/ BPF_FETCH") Reported-by: <[email protected]> Acked-by: Brendan Jackman <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Signed-off-by: Alexei Starovoitov <[email protected]>
2021-12-14audit: use struct_size() helper in kmalloc()Xiu Jianfeng3-3/+3
Make use of struct_size() helper instead of an open-coded calucation. Link: https://github.com/KSPP/linux/issues/160 Signed-off-by: Xiu Jianfeng <[email protected]> Reviewed-by: Kees Cook <[email protected]> Signed-off-by: Paul Moore <[email protected]>
2021-12-14signal: Skip the altstack update when not neededChang S. Bae1-0/+9
== Background == Support for large, "dynamic" fpstates was recently merged. This included code to ensure that sigaltstacks are sufficiently sized for these large states. A new lock was added to remove races between enabling large features and setting up sigaltstacks. == Problem == The new lock (sigaltstack_lock()) is acquired in the sigreturn path before restoring the old sigaltstack. Unfortunately, contention on the new lock causes a measurable signal handling performance regression [1]. However, the common case is that no *changes* are made to the sigaltstack state at sigreturn. == Solution == do_sigaltstack() acquires sigaltstack_lock() and is used for both sys_sigaltstack() and restoring the sigaltstack in sys_sigreturn(). Check for changes to the sigaltstack before taking the lock. If no changes were made, return before acquiring the lock. This removes lock contention from the common-case sigreturn path. [1] https://lore.kernel.org/lkml/20211207012128.GA16074@xsang-OptiPlex-9020/ Fixes: 3aac3ebea08f ("x86/signal: Implement sigaltstack size validation") Reported-by: kernel test robot <[email protected]> Signed-off-by: Chang S. Bae <[email protected]> Signed-off-by: Dave Hansen <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Link: https://lkml.kernel.org/r/[email protected]
2021-12-14cgroup: return early if it is already on preloaded listWei Yang1-2/+2
If a cset is already on preloaded list, this means we have already setup this cset properly for migration. This patch just relocates the root cgrp lookup which isn't used anyway when the cset is already on the preloaded list. [[email protected]: rephrase the commit log] Signed-off-by: Wei Yang <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2021-12-14arm64: Enable KCSANKefeng Wang1-0/+1
This patch enables KCSAN for arm64, with updates to build rules to not use KCSAN for several incompatible compilation units. Recent GCC version(at least GCC10) made outline-atomics as the default option(unlike Clang), which will cause linker errors for kernel/kcsan/core.o. Disables the out-of-line atomics by no-outline-atomics to fix the linker errors. Meanwhile, as Mark said[1], some latent issues are needed to be fixed which isn't just a KCSAN problem, we make the KCSAN depends on EXPERT for now. Tested selftest and kcsan_test(built with GCC11 and Clang 13), and all passed. [1] https://lkml.kernel.org/r/YadiUPpJ0gADbiHQ@FVFF77S0Q05N Acked-by: Marco Elver <[email protected]> # kernel/kcsan Tested-by: Joey Gouly <[email protected]> Signed-off-by: Kefeng Wang <[email protected]> Link: https://lore.kernel.org/r/[email protected] [[email protected]: added comment to justify EXPERT] Signed-off-by: Catalin Marinas <[email protected]>
2021-12-14exit/kthread: Fix the kerneldoc comment for kthread_complete_and_exitEric W. Biederman1-1/+1
I misspelled kthread_complete_and_exit in the kernel doc comment fix it. Link: https://lkml.kernel.org/r/[email protected] Link: https://lkml.kernel.org/r/[email protected] Reported-by: kernel test robot <[email protected]> Fixes: cead18552660 ("exit: Rename complete_and_exit to kthread_complete_and_exit") Signed-off-by: "Eric W. Biederman" <[email protected]>
2021-12-14Merge branch 'irq/urgent' into irq/msiThomas Gleixner6-23/+32
to pick up the PCI/MSI-x fixes. Signed-off-by: Thomas Gleixner <[email protected]>
2021-12-14perf: Add a counter for number of user access events in contextRob Herring1-0/+4
On arm64, user space counter access will be controlled differently compared to x86. On x86, access in the strictest mode is enabled for all tasks in an MM when any event is mmap'ed. For arm64, access is explicitly requested for an event and only enabled when the event's context is active. This avoids hooks into the arch context switch code and gives better control of when access is enabled. In order to configure user space access when the PMU is enabled, it is necessary to know if any event (currently active or not) in the current context has user space accessed enabled. Add a counter similar to other counters in the context to avoid walking the event list every time. Reviewed-by: Mark Rutland <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Signed-off-by: Rob Herring <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]>
2021-12-13bpf: Let bpf_warn_invalid_xdp_action() report more infoPaolo Abeni2-4/+4
In non trivial scenarios, the action id alone is not sufficient to identify the program causing the warning. Before the previous patch, the generated stack-trace pointed out at least the involved device driver. Let's additionally include the program name and id, and the relevant device name. If the user needs additional infos, he can fetch them via a kernel probe, leveraging the arguments added here. Signed-off-by: Paolo Abeni <[email protected]> Signed-off-by: Daniel Borkmann <[email protected]> Acked-by: Toke Høiland-Jørgensen <[email protected]> Link: https://lore.kernel.org/bpf/ddb96bb975cbfddb1546cf5da60e77d5100b533c.1638189075.git.pabeni@redhat.com
2021-12-13cgroup/cpuset: Don't let child cpusets restrict parent in default hierarchyWaiman Long1-11/+3
In validate_change(), there is a check since v2.6.12 to make sure that each of the child cpusets must be a subset of a parent cpuset. IOW, it allows child cpusets to restrict what changes can be made to a parent's "cpuset.cpus". This actually violates one of the core principles of the default hierarchy where a cgroup higher up in the hierarchy should be able to change configuration however it sees fit as deligation breaks down otherwise. To address this issue, the check is now removed for the default hierarchy to free parent cpusets from being restricted by child cpusets. The check will still apply for legacy hierarchy. Suggested-by: Tejun Heo <[email protected]> Signed-off-by: Waiman Long <[email protected]> Signed-off-by: Tejun Heo <[email protected]>
2021-12-13exit/kthread: Move the exit code for kernel threads into struct kthreadEric W. Biederman1-2/+5
The exit code of kernel threads has different semantics than the exit_code of userspace tasks. To avoid confusion and allow the userspace implementation to change as needed move the kernel thread exit code into struct kthread. Signed-off-by: "Eric W. Biederman" <[email protected]>
2021-12-13kthread: Ensure struct kthread is present for all kthreadsEric W. Biederman3-25/+26
Today the rules are a bit iffy and arbitrary about which kernel threads have struct kthread present. Both idle threads and thread started with create_kthread want struct kthread present so that is effectively all kernel threads. Make the rule that if PF_KTHREAD and the task is running then struct kthread is present. This will allow the kernel thread code to using tsk->exit_code with different semantics from ordinary processes. To make ensure that struct kthread is present for all kernel threads move it's allocation into copy_process. Add a deallocation of struct kthread in exec for processes that were kernel threads. Move the allocation of struct kthread for the initial thread earlier so that it is not repeated for each additional idle thread. Move the initialization of struct kthread into set_kthread_struct so that the structure is always and reliably initailized. Clear set_child_tid in free_kthread_struct to ensure the kthread struct is reliably freed during exec. The function free_kthread_struct does not need to clear vfork_done during exec as exec_mm_release called from exec_mmap has already cleared vfork_done. Signed-off-by: "Eric W. Biederman" <[email protected]>
2021-12-13exit: Rename complete_and_exit to kthread_complete_and_exitEric W. Biederman2-9/+21
Update complete_and_exit to call kthread_exit instead of do_exit. Change the name to reflect this change in functionality. All of the users of complete_and_exit are causing the current kthread to exit so this change makes it clear what is happening. Move the implementation of kthread_complete_and_exit from kernel/exit.c to to kernel/kthread.c. As this function is kthread specific it makes most sense to live with the kthread functions. There are no functional change. Signed-off-by: "Eric W. Biederman" <[email protected]>
2021-12-13exit: Rename module_put_and_exit to module_put_and_kthread_exitEric W. Biederman1-3/+3
Update module_put_and_exit to call kthread_exit instead of do_exit. Change the name to reflect this change in functionality. All of the users of module_put_and_exit are causing the current kthread to exit so this change makes it clear what is happening. There is no functional change. Signed-off-by: "Eric W. Biederman" <[email protected]>
2021-12-13exit: Implement kthread_exitEric W. Biederman1-4/+19
The way the per task_struct exit_code is used by kernel threads is not quite compatible how it is used by userspace applications. The low byte of the userspace exit_code value encodes the exit signal. While kthreads just use the value as an int holding ordinary kernel function exit status like -EPERM. Add kthread_exit to clearly separate the two kinds of uses. Signed-off-by: "Eric W. Biederman" <[email protected]>
2021-12-13exit: Stop exporting do_exitEric W. Biederman1-1/+0
Now that there are no more modular uses of do_exit remove the EXPORT_SYMBOL. Suggested-by: Christoph Hellwig <[email protected]> Signed-off-by: "Eric W. Biederman" <[email protected]>
2021-12-13exit: Stop poorly open coding do_task_dead in make_task_deadEric W. Biederman1-2/+1
When the kernel detects it is oops or otherwise force killing a task while it exits the code poorly attempts to permanently stop the task from scheduling. I say poorly because it is possible for a task in TASK_UINTERRUPTIBLE to be woken up. As it makes no sense for the task to continue call do_task_dead instead which actually does the work and permanently removes the task from the scheduler. Guaranteeing the task will never be woken up again. Signed-off-by: "Eric W. Biederman" <[email protected]>